AI & ML Overview¶

Document Purpose: This document provides an overview of AI/ML capabilities in the Entheory.AI platform, including OCR, ASR, NLP, and future AI features.

Executive Summary¶

Entheory.AI uses AI/ML to transform unstructured clinical data (scanned documents, audio recordings, free-text notes) into structured, searchable patient records. Our models are optimized for India-first deployment with Hindi + English support.

Related Documentation: - OCR & ASR Details – Engine selection and configuration - Safety & Evaluation – Clinical safety and model validation - MCP OCR Servers – OCR architecture trade-offs

AI Processing Pipeline¶

sequenceDiagram
    participant U as User/Clinician
    participant API as API Gateway
    participant Q as Job Queue
    participant OCR as OCR Worker
    participant ASR as ASR Worker
    participant NLP as NLP Pipeline
    participant B as Patient Bundle

    rect rgb(255, 240, 245)
        Note over U,B: Document Processing Flow
        U->>API: Upload PDF document
        API->>Q: Enqueue OCR job
        Q->>OCR: Process document
        OCR->>OCR: Language detection
        OCR->>OCR: Tesseract inference
        OCR->>NLP: Extract entities
        NLP->>NLP: NER (medications, diagnoses)
        NLP->>B: Update patient bundle
    end

    rect rgb(240, 255, 245)
        Note over U,B: Audio Processing Flow
        U->>API: Upload audio recording
        API->>Q: Enqueue ASR job
        Q->>ASR: Process audio
        ASR->>ASR: Whisper transcription
        ASR->>ASR: Speaker diarization
        ASR->>NLP: Generate SOAP note
        NLP->>NLP: Structure clinical entities
        NLP->>B: Update patient bundle
    end

    B-->>API: Return updated data
    API-->>U: Display in UI

1. AI/ML Capabilities by Use Case¶

1.1 Processing Pipelines (12 Use Cases)¶

Use Case ID	Name	AI/ML Component
PROC-001	Queue OCR Job	Job orchestration
PROC-002	Detect Document Language	Language detection model
PROC-003	Execute Tesseract Engine	OCR inference
PROC-004	Process OCR Output	Post-processing
PROC-007	Classify Document Type	Document classification
PROC-008	Redact Sensitive Entities	NER for PII detection
PROC-009	Extract Structured Fields	Field extraction
PROC-010	Summarize Document Content	LLM summarization
PROC-005	Queue ASR Job	Job orchestration
PROC-006	Execute Whisper Engine	ASR inference
PROC-011	Identify Speaker Turns	Speaker diarization
PROC-012	Generate Encounter Note	LLM note generation

1.2 NLP/NLU Pipelines (6 Use Cases)¶

Use Case ID	Name	AI/ML Component
NLP-101	Generate Structured SOAP Notes	LLM structuring
NLP-102a	Extract Medications (RxNorm)	Medical NER
NLP-102b	Extract Diagnoses (ICD-10)	Medical NER
NLP-102c	Extract Procedures & Symptoms	Medical NER
NLP-103	Summarization + Noise Filtering	LLM summarization
NLP-104	EMR Field Mapping	Entity linking

1.3 Oncology AI (10+ Use Cases)¶

Use Case ID	Name	AI/ML Component
ONC-001	Extract Tumor Location	Medical NER
ONC-002	Extract Histopathology Findings	Pathology NER
ONC-003	Extract Cancer Stage (TNM)	Staging extraction
ONC-011	Detect RECIST Lesions	Radiology AI
ONC-014	Auto-score Response	RECIST classifier
ONC-040	Parse NGS Reports	Genomics NER
ONC-042	Map to Actionable Therapies	Knowledge graph

1.4 Imaging AI (2 Use Cases)¶

Use Case ID	Name	AI/ML Component
IMG-015	AI Inference Scheduling	Vision model orchestration
ONC-012	Track Lesion Progression	Lesion tracking AI

2. Model Stack¶

2.1 OCR (Optical Character Recognition)¶

Component	Technology	Purpose
Primary Engine	Tesseract 5	Open-source, Hindi + English
Alternative	PaddleOCR	Higher accuracy for complex layouts
Cloud Fallback	Google Cloud Vision	High-confidence fallback for low-quality scans
Language Packs	Hindi, English, Tamil (planned)	Bilingual medical documents

Use Cases: PROC-001 through PROC-010

2.2 ASR (Automatic Speech Recognition)¶

Component	Technology	Purpose
Primary Engine	Whisper (Large V3)	Multi-lingual, code-switching support
Diarization	PyAnnote	Speaker turn identification
Medical Vocabulary	Custom fine-tuning	Medical terminology accuracy
Noise Handling	DeepFilterNet	Audio enhancement pre-processing

Use Cases: PROC-005, PROC-006, CAP-001

2.3 NLP/NLU¶

Component	Technology	Purpose
Medical NER	BioBERT / MedSpaCy	Entity extraction (drugs, diagnoses)
Code Mapping	Custom + UMLS	RxNorm, ICD-10, SNOMED linking
Summarization	LLM (GPT-4 / Claude / Gemini)	Document and encounter summaries
SOAP Generation	LLM + Templates	Structured clinical notes
Knowledge-Augmented	LLM + Medical Ontologies	High-accuracy inference on small models (4B)

2.5 Strategic Architecture: The Efficiency Frontier¶

Entheory.AI prioritizes Knowledge-Augmented Generation (KAG) to solve the "LLM Hallucination" problem while maintaining computational efficiency:

Ontology Alignment: Instead of relying on raw 1T+ parameter models, we use specialized 4B parameter models grounded in Medical Knowledge Graphs (SNOMED CT, ICD-11).
Efficiency Benchmarks: This architecture achieves 88-90% accuracy on clinical tasks—par with models 100x larger—enabling deployment in low-resource/edge environments.
Medical Hierarchy: Context is provided via structured medical hierarchies rather than simple text retrieval, ensuring clinical relevance and safety.

Use Cases: NLP-101, ONC-002

2.4 Oncology-Specific¶

Component	Technology	Purpose
TNM Extraction	Rule-based + NER	Cancer staging
Biomarker Analysis	Pattern matching + LLM	IHC panel interpretation
Genomics Parsing	Custom VCF parser	NGS variant extraction
RECIST Scoring	Rule-based	Treatment response assessment

Use Cases: ONC-001 through ONC-062

3. India-Specific Optimizations¶

3.1 Language Support¶

Language	OCR	ASR	NLP	Status
English	✅	✅	✅	Production
Hindi	✅	✅	✅	Production
Hinglish (Code-switch)	✅	✅	🔄	Beta
Tamil	🔄	🔄	🔄	Planned Q2 2025
Telugu	🔄	🔄	🔄	Planned Q3 2025

3.2 Medical Terminology¶

Drug Names: Mapped to Indian generic brands + RxNorm
Diagnoses: ICD-10 with India-specific codes
Procedures: SNOMED + India-specific procedure codes
Abbreviations: Common Indian clinical abbreviations

Use Cases: IN-ONC-003, IN-ONC-004

4. Model Lifecycle¶

4.1 Training & Fine-Tuning¶

Use Case ID	Name	Purpose
ML-001a	Curate Training Dataset	Data preparation
ML-001b	Execute Fine-tuning Run	Model training
ML-002	Dialect Evaluation & Benchmarking	Performance testing
ML-003	Continuous Quality Feedback Loop	RLHF pipeline

4.2 Deployment & Monitoring¶

Use Case ID	Name	Purpose
OPS-302	Monitor Inference Time & Failures	Performance tracking
QAS-001	Record Model Failures	Error tracking
QAS-004	Model Drift Detection	Quality monitoring
OPS-303	Human-in-the-Loop Correction	Feedback capture

5. Performance Targets¶

Model	Metric	Target	Current
OCR (English)	Character accuracy	>95%	~93%
OCR (Hindi)	Character accuracy	>85%	~82%
ASR (English)	Word Error Rate	<10%	~8%
ASR (Hindi)	Word Error Rate	<15%	~14%
ASR (Hinglish)	Word Error Rate	<20%	~18%
NER (Medications)	F1 Score	>90%	~88%
NER (Diagnoses)	F1 Score	>85%	~84%
TNM Staging	Accuracy	>90%	~87%

6. AI Safety & Transparency¶

6.1 Design Principles¶

No Black Box Decisions: All AI outputs require clinician review before action
Confidence Scores: Low-confidence outputs flagged for manual review
Provenance: Every AI-generated field traceable to source document
Audit Trail: All AI inferences logged with model version

6.2 Human-in-the-Loop¶

Stage	AI Role	Human Role
OCR/ASR	Generate text	Review low-confidence segments
NER	Suggest entities	Confirm/correct before EMR push
SOAP Notes	Draft structure	Approve before finalization
Alerts	Flag potential issues	Acknowledge and act

See: Safety & Evaluation for detailed safety protocols

7. Future Roadmap¶

Phase 1: Current (MVP)¶

✅ OCR (English + Hindi)
✅ ASR (English + Hindi)
✅ Basic NER (medications, diagnoses)
✅ SOAP note generation

Phase 2: Near-Term (6-12 months)¶

🔄 Regional language support (Tamil, Telugu)
🔄 Improved Hinglish handling
🔄 Document summarization
🔄 Clinical trial eligibility screening

Phase 3: Future (12-24 months)¶

📋 Cohort selection for research
📋 Predictive analytics (outcome forecasting)
📋 Radiology AI (lesion detection)
📋 Drug interaction prediction

Document Owner: AI/ML Engineering Team
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with model releases)