AI & ML Overview
Document Purpose: This document provides an overview of AI/ML capabilities in the Entheory.AI platform, including OCR, ASR, NLP, and future AI features.
Executive Summary
Entheory.AI uses AI/ML to transform unstructured clinical data (scanned documents, audio recordings, free-text notes) into structured, searchable patient records. Our models are optimized for India-first deployment with Hindi + English support.
Related Documentation:
- OCR & ASR Details – Engine selection and configuration
- Safety & Evaluation – Clinical safety and model validation
- MCP OCR Servers – OCR architecture trade-offs
AI Processing Pipeline
sequenceDiagram
participant U as User/Clinician
participant API as API Gateway
participant Q as Job Queue
participant OCR as OCR Worker
participant ASR as ASR Worker
participant NLP as NLP Pipeline
participant B as Patient Bundle
rect rgb(255, 240, 245)
Note over U,B: Document Processing Flow
U->>API: Upload PDF document
API->>Q: Enqueue OCR job
Q->>OCR: Process document
OCR->>OCR: Language detection
OCR->>OCR: Tesseract inference
OCR->>NLP: Extract entities
NLP->>NLP: NER (medications, diagnoses)
NLP->>B: Update patient bundle
end
rect rgb(240, 255, 245)
Note over U,B: Audio Processing Flow
U->>API: Upload audio recording
API->>Q: Enqueue ASR job
Q->>ASR: Process audio
ASR->>ASR: Whisper transcription
ASR->>ASR: Speaker diarization
ASR->>NLP: Generate SOAP note
NLP->>NLP: Structure clinical entities
NLP->>B: Update patient bundle
end
B-->>API: Return updated data
API-->>U: Display in UI
1. AI/ML Capabilities by Use Case
1.1 Processing Pipelines (12 Use Cases)
| Use Case ID |
Name |
AI/ML Component |
| PROC-001 |
Queue OCR Job |
Job orchestration |
| PROC-002 |
Detect Document Language |
Language detection model |
| PROC-003 |
Execute Tesseract Engine |
OCR inference |
| PROC-004 |
Process OCR Output |
Post-processing |
| PROC-007 |
Classify Document Type |
Document classification |
| PROC-008 |
Redact Sensitive Entities |
NER for PII detection |
| PROC-009 |
Extract Structured Fields |
Field extraction |
| PROC-010 |
Summarize Document Content |
LLM summarization |
| PROC-005 |
Queue ASR Job |
Job orchestration |
| PROC-006 |
Execute Whisper Engine |
ASR inference |
| PROC-011 |
Identify Speaker Turns |
Speaker diarization |
| PROC-012 |
Generate Encounter Note |
LLM note generation |
1.2 NLP/NLU Pipelines (6 Use Cases)
| Use Case ID |
Name |
AI/ML Component |
| NLP-101 |
Generate Structured SOAP Notes |
LLM structuring |
| NLP-102a |
Extract Medications (RxNorm) |
Medical NER |
| NLP-102b |
Extract Diagnoses (ICD-10) |
Medical NER |
| NLP-102c |
Extract Procedures & Symptoms |
Medical NER |
| NLP-103 |
Summarization + Noise Filtering |
LLM summarization |
| NLP-104 |
EMR Field Mapping |
Entity linking |
1.3 Oncology AI (10+ Use Cases)
| Use Case ID |
Name |
AI/ML Component |
| ONC-001 |
Extract Tumor Location |
Medical NER |
| ONC-002 |
Extract Histopathology Findings |
Pathology NER |
| ONC-003 |
Extract Cancer Stage (TNM) |
Staging extraction |
| ONC-011 |
Detect RECIST Lesions |
Radiology AI |
| ONC-014 |
Auto-score Response |
RECIST classifier |
| ONC-040 |
Parse NGS Reports |
Genomics NER |
| ONC-042 |
Map to Actionable Therapies |
Knowledge graph |
1.4 Imaging AI (2 Use Cases)
| Use Case ID |
Name |
AI/ML Component |
| IMG-015 |
AI Inference Scheduling |
Vision model orchestration |
| ONC-012 |
Track Lesion Progression |
Lesion tracking AI |
2. Model Stack
2.1 OCR (Optical Character Recognition)
| Component |
Technology |
Purpose |
| Primary Engine |
Tesseract 5 |
Open-source, Hindi + English |
| Alternative |
PaddleOCR |
Higher accuracy for complex layouts |
| Cloud Fallback |
Google Cloud Vision |
High-confidence fallback for low-quality scans |
| Language Packs |
Hindi, English, Tamil (planned) |
Bilingual medical documents |
Use Cases: PROC-001 through PROC-010
2.2 ASR (Automatic Speech Recognition)
| Component |
Technology |
Purpose |
| Primary Engine |
Whisper (Large V3) |
Multi-lingual, code-switching support |
| Diarization |
PyAnnote |
Speaker turn identification |
| Medical Vocabulary |
Custom fine-tuning |
Medical terminology accuracy |
| Noise Handling |
DeepFilterNet |
Audio enhancement pre-processing |
Use Cases: PROC-005, PROC-006, CAP-001
2.3 NLP/NLU
| Component |
Technology |
Purpose |
| Medical NER |
BioBERT / MedSpaCy |
Entity extraction (drugs, diagnoses) |
| Code Mapping |
Custom + UMLS |
RxNorm, ICD-10, SNOMED linking |
| Summarization |
LLM (GPT-4 / Claude / Gemini) |
Document and encounter summaries |
| SOAP Generation |
LLM + Templates |
Structured clinical notes |
| Knowledge-Augmented |
LLM + Medical Ontologies |
High-accuracy inference on small models (4B) |
2.5 Strategic Architecture: The Efficiency Frontier
Entheory.AI prioritizes Knowledge-Augmented Generation (KAG) to solve the "LLM Hallucination" problem while maintaining computational efficiency:
- Ontology Alignment: Instead of relying on raw 1T+ parameter models, we use specialized 4B parameter models grounded in Medical Knowledge Graphs (SNOMED CT, ICD-11).
- Efficiency Benchmarks: This architecture achieves 88-90% accuracy on clinical tasks—par with models 100x larger—enabling deployment in low-resource/edge environments.
- Medical Hierarchy: Context is provided via structured medical hierarchies rather than simple text retrieval, ensuring clinical relevance and safety.
Use Cases: NLP-101, ONC-002
2.4 Oncology-Specific
| Component |
Technology |
Purpose |
| TNM Extraction |
Rule-based + NER |
Cancer staging |
| Biomarker Analysis |
Pattern matching + LLM |
IHC panel interpretation |
| Genomics Parsing |
Custom VCF parser |
NGS variant extraction |
| RECIST Scoring |
Rule-based |
Treatment response assessment |
Use Cases: ONC-001 through ONC-062
3. India-Specific Optimizations
3.1 Language Support
| Language |
OCR |
ASR |
NLP |
Status |
| English |
✅ |
✅ |
✅ |
Production |
| Hindi |
✅ |
✅ |
✅ |
Production |
| Hinglish (Code-switch) |
✅ |
✅ |
🔄 |
Beta |
| Tamil |
🔄 |
🔄 |
🔄 |
Planned Q2 2025 |
| Telugu |
🔄 |
🔄 |
🔄 |
Planned Q3 2025 |
3.2 Medical Terminology
- Drug Names: Mapped to Indian generic brands + RxNorm
- Diagnoses: ICD-10 with India-specific codes
- Procedures: SNOMED + India-specific procedure codes
- Abbreviations: Common Indian clinical abbreviations
Use Cases: IN-ONC-003, IN-ONC-004
4. Model Lifecycle
4.1 Training & Fine-Tuning
| Use Case ID |
Name |
Purpose |
| ML-001a |
Curate Training Dataset |
Data preparation |
| ML-001b |
Execute Fine-tuning Run |
Model training |
| ML-002 |
Dialect Evaluation & Benchmarking |
Performance testing |
| ML-003 |
Continuous Quality Feedback Loop |
RLHF pipeline |
4.2 Deployment & Monitoring
| Use Case ID |
Name |
Purpose |
| OPS-302 |
Monitor Inference Time & Failures |
Performance tracking |
| QAS-001 |
Record Model Failures |
Error tracking |
| QAS-004 |
Model Drift Detection |
Quality monitoring |
| OPS-303 |
Human-in-the-Loop Correction |
Feedback capture |
| Model |
Metric |
Target |
Current |
| OCR (English) |
Character accuracy |
>95% |
~93% |
| OCR (Hindi) |
Character accuracy |
>85% |
~82% |
| ASR (English) |
Word Error Rate |
<10% |
~8% |
| ASR (Hindi) |
Word Error Rate |
<15% |
~14% |
| ASR (Hinglish) |
Word Error Rate |
<20% |
~18% |
| NER (Medications) |
F1 Score |
>90% |
~88% |
| NER (Diagnoses) |
F1 Score |
>85% |
~84% |
| TNM Staging |
Accuracy |
>90% |
~87% |
6. AI Safety & Transparency
6.1 Design Principles
- No Black Box Decisions: All AI outputs require clinician review before action
- Confidence Scores: Low-confidence outputs flagged for manual review
- Provenance: Every AI-generated field traceable to source document
- Audit Trail: All AI inferences logged with model version
6.2 Human-in-the-Loop
| Stage |
AI Role |
Human Role |
| OCR/ASR |
Generate text |
Review low-confidence segments |
| NER |
Suggest entities |
Confirm/correct before EMR push |
| SOAP Notes |
Draft structure |
Approve before finalization |
| Alerts |
Flag potential issues |
Acknowledge and act |
See: Safety & Evaluation for detailed safety protocols
7. Future Roadmap
Phase 1: Current (MVP)
- ✅ OCR (English + Hindi)
- ✅ ASR (English + Hindi)
- ✅ Basic NER (medications, diagnoses)
- ✅ SOAP note generation
Phase 2: Near-Term (6-12 months)
- 🔄 Regional language support (Tamil, Telugu)
- 🔄 Improved Hinglish handling
- 🔄 Document summarization
- 🔄 Clinical trial eligibility screening
Phase 3: Future (12-24 months)
- 📋 Cohort selection for research
- 📋 Predictive analytics (outcome forecasting)
- 📋 Radiology AI (lesion detection)
- 📋 Drug interaction prediction
Document Owner: AI/ML Engineering Team
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with model releases)