OCR & ASR¶

Document Purpose: This document details the OCR and ASR engine configurations, language support, and optimization strategies for the Entheory.AI platform.

1. OCR (Optical Character Recognition)¶

1.1 Engine Selection¶

Engine	Strengths	Weaknesses	Use Case
Tesseract 5	Open-source, good Hindi support, on-prem	Lower accuracy on complex layouts	Default for most documents
PaddleOCR	High accuracy, table extraction	Heavier compute requirements	Complex forms, tables
Google Cloud Vision	Highest accuracy, handwriting	Cloud-only, cost per page	Fallback for low-confidence

1.2 Language Configuration¶

# OCR Language Configuration
tesseract:
  languages:

    - eng        # English
    - hin        # Hindi
    - eng+hin    # Bilingual (code-switch)

  # Future additions
  planned:

    - tam        # Tamil (Q2 2025)
    - tel        # Telugu (Q3 2025)
    - kan        # Kannada (Q4 2025)

1.3 Document Preprocessing¶

Step	Purpose	Tool
Deskew	Correct rotation	OpenCV
Denoise	Remove scanner artifacts	OpenCV bilateral filter
Binarize	Improve contrast	Adaptive thresholding
DPI Normalization	Standardize resolution	Pillow (300 DPI target)

1.4 Post-Processing¶

Step	Purpose	Implementation
Spell Correction	Fix OCR errors	SymSpell + medical dictionary
Layout Analysis	Preserve structure	Document segmentation
Table Extraction	Structured tables	Camelot / PaddleOCR tables
Entity Linking	Normalize terms	Medical ontology lookup

1.5 Use Case Links¶

Use Case ID	Name	Link
PROC-001	Queue OCR Job	View
PROC-002	Detect Document Language	View
PROC-003	Execute Tesseract Engine	View
PROC-007	Classify Document Type	View
PROC-009	Extract Structured Fields	View

2. ASR (Automatic Speech Recognition)¶

2.1 Engine Selection¶

Engine	Model Size	Languages	Use Case
Whisper Large V3	~3GB	Multilingual	Production inference
Whisper Medium	~1.5GB	Multilingual	Low-latency fallback
Custom Fine-tuned	Variable	Hindi + English	Medical terminology

2.2 Model Configuration¶

# ASR Configuration
whisper:
  model: "large-v3"
  language: "auto"  # Auto-detect with Hindi priority
  task: "transcribe"

  # Performance tuning
  beam_size: 5
  best_of: 5
  compression_ratio_threshold: 2.4
  no_speech_threshold: 0.6

  # Medical vocabulary boost
  initial_prompt: "Consultation with patient. Medical terms include..."

2.3 Speaker Diarization¶

Component	Technology	Purpose
Segmentation	PyAnnote	Detect speaker changes
Embedding	ECAPA-TDNN	Speaker identification
Clustering	Spectral	Group speaker segments

2.4 Code-Switching Handling¶

Hinglish (Hindi-English mix) is common in Indian clinical settings:

Challenge	Solution
Language switching mid-sentence	Whisper's multilingual capability
Medical terms in English	Medical vocabulary injection
Hindi transcription accuracy	Fine-tuning on Indian clinical audio
Regional accents	Accent-aware model fine-tuning

2.5 Audio Preprocessing¶

Step	Purpose	Tool
Noise Reduction	Remove background noise	DeepFilterNet
Normalization	Consistent volume	PyDub
Segmentation	Split long recordings	Voice Activity Detection
Format Conversion	Standardize format	FFmpeg (16kHz, mono)

2.6 Use Case Links¶

Use Case ID	Name	Link
CAP-001	Start Recording / Start Encounter	View
CAP-002	Stop / Pause Recording	View
PROC-005	Queue ASR Job	View
PROC-006	Execute Whisper Engine	View
PROC-011	Identify Speaker Turns	View
PROC-012	Generate Encounter Note	View

3. Performance Benchmarks¶

3.1 OCR Accuracy¶

Document Type	English	Hindi	Hinglish
Printed Lab Reports	96%	88%	85%
Printed Prescriptions	94%	85%	82%
Scanned Forms	92%	82%	78%
Handwritten Notes	75%	65%	60%

3.2 ASR Accuracy (Word Error Rate)¶

Audio Type	English	Hindi	Hinglish
Quiet Room	6%	10%	14%
Clinic Background	9%	14%	18%
Mobile Recording	12%	18%	22%

3.3 Processing Speed¶

Task	Target	Current
OCR per page	<5 seconds	~3 seconds
ASR (real-time factor)	<0.5x	~0.4x
Full encounter processing	<2 minutes	~90 seconds

4. Infrastructure¶

4.1 GPU Requirements¶

Workload	GPU Type	Memory	Count
Whisper Large	A10/A100	24GB	2+
Tesseract/PaddleOCR	CPU or T4	8GB	2+
Batch Processing	A100	40GB	1+

4.2 Queue Architecture¶

Audio Upload → [ASR Queue] → Whisper Workers → [NLP Queue] → SOAP Generation
                    ↓
Document Upload → [OCR Queue] → Tesseract Workers → [Entity Queue] → NER

See: Pipelines & Ingestion for full architecture

AI & ML Overview – High-level AI capabilities
Safety & Evaluation – Model validation and safety
MCP OCR Servers – OCR deployment options
Processing Use Cases – Detailed use cases

Document Owner: AI/ML Engineering Team
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with model releases)