Skip to content

OCR & ASR

Document Purpose: This document details the OCR and ASR engine configurations, language support, and optimization strategies for the Entheory.AI platform.


1. OCR (Optical Character Recognition)

1.1 Engine Selection

Engine Strengths Weaknesses Use Case
Tesseract 5 Open-source, good Hindi support, on-prem Lower accuracy on complex layouts Default for most documents
PaddleOCR High accuracy, table extraction Heavier compute requirements Complex forms, tables
Google Cloud Vision Highest accuracy, handwriting Cloud-only, cost per page Fallback for low-confidence

1.2 Language Configuration

# OCR Language Configuration
tesseract:
  languages:

    - eng        # English
    - hin        # Hindi
    - eng+hin    # Bilingual (code-switch)

  # Future additions
  planned:

    - tam        # Tamil (Q2 2025)
    - tel        # Telugu (Q3 2025)
    - kan        # Kannada (Q4 2025)

1.3 Document Preprocessing

Step Purpose Tool
Deskew Correct rotation OpenCV
Denoise Remove scanner artifacts OpenCV bilateral filter
Binarize Improve contrast Adaptive thresholding
DPI Normalization Standardize resolution Pillow (300 DPI target)

1.4 Post-Processing

Step Purpose Implementation
Spell Correction Fix OCR errors SymSpell + medical dictionary
Layout Analysis Preserve structure Document segmentation
Table Extraction Structured tables Camelot / PaddleOCR tables
Entity Linking Normalize terms Medical ontology lookup
Use Case ID Name Link
PROC-001 Queue OCR Job View
PROC-002 Detect Document Language View
PROC-003 Execute Tesseract Engine View
PROC-007 Classify Document Type View
PROC-009 Extract Structured Fields View

2. ASR (Automatic Speech Recognition)

2.1 Engine Selection

Engine Model Size Languages Use Case
Whisper Large V3 ~3GB Multilingual Production inference
Whisper Medium ~1.5GB Multilingual Low-latency fallback
Custom Fine-tuned Variable Hindi + English Medical terminology

2.2 Model Configuration

# ASR Configuration
whisper:
  model: "large-v3"
  language: "auto"  # Auto-detect with Hindi priority
  task: "transcribe"

  # Performance tuning
  beam_size: 5
  best_of: 5
  compression_ratio_threshold: 2.4
  no_speech_threshold: 0.6

  # Medical vocabulary boost
  initial_prompt: "Consultation with patient. Medical terms include..."

2.3 Speaker Diarization

Component Technology Purpose
Segmentation PyAnnote Detect speaker changes
Embedding ECAPA-TDNN Speaker identification
Clustering Spectral Group speaker segments

2.4 Code-Switching Handling

Hinglish (Hindi-English mix) is common in Indian clinical settings:

Challenge Solution
Language switching mid-sentence Whisper's multilingual capability
Medical terms in English Medical vocabulary injection
Hindi transcription accuracy Fine-tuning on Indian clinical audio
Regional accents Accent-aware model fine-tuning

2.5 Audio Preprocessing

Step Purpose Tool
Noise Reduction Remove background noise DeepFilterNet
Normalization Consistent volume PyDub
Segmentation Split long recordings Voice Activity Detection
Format Conversion Standardize format FFmpeg (16kHz, mono)
Use Case ID Name Link
CAP-001 Start Recording / Start Encounter View
CAP-002 Stop / Pause Recording View
PROC-005 Queue ASR Job View
PROC-006 Execute Whisper Engine View
PROC-011 Identify Speaker Turns View
PROC-012 Generate Encounter Note View

3. Performance Benchmarks

3.1 OCR Accuracy

Document Type English Hindi Hinglish
Printed Lab Reports 96% 88% 85%
Printed Prescriptions 94% 85% 82%
Scanned Forms 92% 82% 78%
Handwritten Notes 75% 65% 60%

3.2 ASR Accuracy (Word Error Rate)

Audio Type English Hindi Hinglish
Quiet Room 6% 10% 14%
Clinic Background 9% 14% 18%
Mobile Recording 12% 18% 22%

3.3 Processing Speed

Task Target Current
OCR per page <5 seconds ~3 seconds
ASR (real-time factor) <0.5x ~0.4x
Full encounter processing <2 minutes ~90 seconds

4. Infrastructure

4.1 GPU Requirements

Workload GPU Type Memory Count
Whisper Large A10/A100 24GB 2+
Tesseract/PaddleOCR CPU or T4 8GB 2+
Batch Processing A100 40GB 1+

4.2 Queue Architecture

Audio Upload → [ASR Queue] → Whisper Workers → [NLP Queue] → SOAP Generation
                    ↓
Document Upload → [OCR Queue] → Tesseract Workers → [Entity Queue] → NER

See: Pipelines & Ingestion for full architecture



Document Owner: AI/ML Engineering Team
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with model releases)