OCR & ASR
Document Purpose: This document details the OCR and ASR engine configurations, language support, and optimization strategies for the Entheory.AI platform.
1. OCR (Optical Character Recognition)
1.1 Engine Selection
| Engine |
Strengths |
Weaknesses |
Use Case |
| Tesseract 5 |
Open-source, good Hindi support, on-prem |
Lower accuracy on complex layouts |
Default for most documents |
| PaddleOCR |
High accuracy, table extraction |
Heavier compute requirements |
Complex forms, tables |
| Google Cloud Vision |
Highest accuracy, handwriting |
Cloud-only, cost per page |
Fallback for low-confidence |
1.2 Language Configuration
# OCR Language Configuration
tesseract:
languages:
- eng # English
- hin # Hindi
- eng+hin # Bilingual (code-switch)
# Future additions
planned:
- tam # Tamil (Q2 2025)
- tel # Telugu (Q3 2025)
- kan # Kannada (Q4 2025)
1.3 Document Preprocessing
| Step |
Purpose |
Tool |
| Deskew |
Correct rotation |
OpenCV |
| Denoise |
Remove scanner artifacts |
OpenCV bilateral filter |
| Binarize |
Improve contrast |
Adaptive thresholding |
| DPI Normalization |
Standardize resolution |
Pillow (300 DPI target) |
1.4 Post-Processing
| Step |
Purpose |
Implementation |
| Spell Correction |
Fix OCR errors |
SymSpell + medical dictionary |
| Layout Analysis |
Preserve structure |
Document segmentation |
| Table Extraction |
Structured tables |
Camelot / PaddleOCR tables |
| Entity Linking |
Normalize terms |
Medical ontology lookup |
1.5 Use Case Links
| Use Case ID |
Name |
Link |
| PROC-001 |
Queue OCR Job |
View |
| PROC-002 |
Detect Document Language |
View |
| PROC-003 |
Execute Tesseract Engine |
View |
| PROC-007 |
Classify Document Type |
View |
| PROC-009 |
Extract Structured Fields |
View |
2. ASR (Automatic Speech Recognition)
2.1 Engine Selection
| Engine |
Model Size |
Languages |
Use Case |
| Whisper Large V3 |
~3GB |
Multilingual |
Production inference |
| Whisper Medium |
~1.5GB |
Multilingual |
Low-latency fallback |
| Custom Fine-tuned |
Variable |
Hindi + English |
Medical terminology |
2.2 Model Configuration
# ASR Configuration
whisper:
model: "large-v3"
language: "auto" # Auto-detect with Hindi priority
task: "transcribe"
# Performance tuning
beam_size: 5
best_of: 5
compression_ratio_threshold: 2.4
no_speech_threshold: 0.6
# Medical vocabulary boost
initial_prompt: "Consultation with patient. Medical terms include..."
2.3 Speaker Diarization
| Component |
Technology |
Purpose |
| Segmentation |
PyAnnote |
Detect speaker changes |
| Embedding |
ECAPA-TDNN |
Speaker identification |
| Clustering |
Spectral |
Group speaker segments |
2.4 Code-Switching Handling
Hinglish (Hindi-English mix) is common in Indian clinical settings:
| Challenge |
Solution |
| Language switching mid-sentence |
Whisper's multilingual capability |
| Medical terms in English |
Medical vocabulary injection |
| Hindi transcription accuracy |
Fine-tuning on Indian clinical audio |
| Regional accents |
Accent-aware model fine-tuning |
2.5 Audio Preprocessing
| Step |
Purpose |
Tool |
| Noise Reduction |
Remove background noise |
DeepFilterNet |
| Normalization |
Consistent volume |
PyDub |
| Segmentation |
Split long recordings |
Voice Activity Detection |
| Format Conversion |
Standardize format |
FFmpeg (16kHz, mono) |
2.6 Use Case Links
| Use Case ID |
Name |
Link |
| CAP-001 |
Start Recording / Start Encounter |
View |
| CAP-002 |
Stop / Pause Recording |
View |
| PROC-005 |
Queue ASR Job |
View |
| PROC-006 |
Execute Whisper Engine |
View |
| PROC-011 |
Identify Speaker Turns |
View |
| PROC-012 |
Generate Encounter Note |
View |
3.1 OCR Accuracy
| Document Type |
English |
Hindi |
Hinglish |
| Printed Lab Reports |
96% |
88% |
85% |
| Printed Prescriptions |
94% |
85% |
82% |
| Scanned Forms |
92% |
82% |
78% |
| Handwritten Notes |
75% |
65% |
60% |
3.2 ASR Accuracy (Word Error Rate)
| Audio Type |
English |
Hindi |
Hinglish |
| Quiet Room |
6% |
10% |
14% |
| Clinic Background |
9% |
14% |
18% |
| Mobile Recording |
12% |
18% |
22% |
3.3 Processing Speed
| Task |
Target |
Current |
| OCR per page |
<5 seconds |
~3 seconds |
| ASR (real-time factor) |
<0.5x |
~0.4x |
| Full encounter processing |
<2 minutes |
~90 seconds |
4. Infrastructure
4.1 GPU Requirements
| Workload |
GPU Type |
Memory |
Count |
| Whisper Large |
A10/A100 |
24GB |
2+ |
| Tesseract/PaddleOCR |
CPU or T4 |
8GB |
2+ |
| Batch Processing |
A100 |
40GB |
1+ |
4.2 Queue Architecture
Audio Upload → [ASR Queue] → Whisper Workers → [NLP Queue] → SOAP Generation
↓
Document Upload → [OCR Queue] → Tesseract Workers → [Entity Queue] → NER
See: Pipelines & Ingestion for full architecture
Document Owner: AI/ML Engineering Team
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with model releases)