Safety & Evaluation
Document Purpose: This document defines how Entheory.AI ensures clinical safety, validates AI model outputs, and maintains human oversight for all AI-generated content.
1. Safety Design Principles
1.1 Core Principles
| Principle |
Description |
| No Autonomous Decisions |
AI suggests, humans decide. No AI output directly affects patient care without clinician review. |
| Transparency |
All AI outputs show confidence scores and source provenance. |
| Fail-Safe |
Low-confidence outputs routed to manual review queues. |
| Auditability |
Every AI inference logged with model version, timestamp, input/output. |
| Reversibility |
All AI-generated content can be edited or rejected by clinicians. |
1.2 Human-in-the-Loop Design
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ AI Model │ ──▶ │ Confidence │ ──▶ │ Review │ ──▶ │ EMR/Action │
│ Inference │ │ Assessment │ │ Queue │ │ (Approved) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
High Confidence?
├── Yes: Fast-track (still needs approval)
└── No: Manual Review Required
2. Validation Framework
| Check |
Purpose |
Use Case |
| File Type |
Reject executables, unexpected formats |
API upload |
| Size Limits |
Prevent resource exhaustion |
10MB per document, 60 min audio |
| Language Detection |
Route to correct model |
OCR/ASR preprocessing |
| Quality Assessment |
Flag low-quality inputs |
Blurry scans, noisy audio |
2.2 Output Validation
| Layer |
Validation |
Action if Failed |
| Schema |
FHIR R4 compliance |
Reject, log error |
| Clinical |
Medical code validity (ICD-10, RxNorm) |
Flag for review |
| Confidence |
Below threshold (e.g., <70%) |
Route to manual queue |
| Consistency |
Cross-check with existing patient data |
Highlight discrepancies |
2.3 Use Case Links
| Use Case ID |
Name |
Validation Type |
| QAS-001 |
Record Model Failures |
Error tracking |
| QAS-002 |
Perform Clinical Safety Review |
Human review |
| QAS-003 |
Track Audit Violations |
Compliance |
| QAS-004 |
Model Drift Detection |
Quality monitoring |
3. Model Evaluation
3.1 Accuracy Metrics
| Model |
Metric |
Target |
Measurement |
| OCR |
Character Accuracy |
>90% |
Edit distance vs ground truth |
| ASR |
Word Error Rate (WER) |
<15% |
Levenshtein distance |
| NER |
F1 Score |
>85% |
Precision/Recall on labeled data |
| Classification |
Accuracy |
>90% |
Confusion matrix analysis |
3.2 Benchmark Datasets
| Dataset |
Size |
Purpose |
| OCR-Medical-EN |
5,000 pages |
English medical document OCR |
| OCR-Medical-HI |
2,000 pages |
Hindi/bilingual document OCR |
| ASR-Clinical-EN |
100 hours |
English clinical audio |
| ASR-Clinical-HI |
50 hours |
Hindi/Hinglish clinical audio |
| NER-Oncology |
10,000 annotations |
Cancer-specific entities |
3.3 Evaluation Pipeline
1. Sample production data (anonymized)
2. Generate model predictions
3. Compare against human annotations
4. Compute accuracy metrics
5. Identify error categories
6. Feed back to retraining pipeline
Use Cases: ML-002
4. Confidence Calibration
4.1 Confidence Score Interpretation
| Score Range |
Interpretation |
Action |
| 90-100% |
High confidence |
Fast-track approval |
| 70-89% |
Medium confidence |
Standard review |
| 50-69% |
Low confidence |
Mandatory manual review |
| <50% |
Very low confidence |
Flag as failed, require reprocessing |
4.2 Confidence Display
{
"extraction": "Paclitaxel 175mg/m²",
"confidence": 0.92,
"source": {
"document": "chemo_order_2024-01-15.pdf",
"page": 2,
"coordinates": {"x": 120, "y": 340, "w": 200, "h": 30}
},
"model_version": "ner-med-v2.3.1"
}
4.3 Visual Indicators
| Confidence |
UI Indicator |
| High (90%+) |
Green checkmark |
| Medium (70-89%) |
Yellow warning |
| Low (50-69%) |
Orange flag + highlight |
| Very Low (<50%) |
Red alert + manual review required |
5. Error Handling
5.1 Error Categories
| Category |
Example |
Handling |
| Hallucination |
AI invents a medication not in source |
Flag for review, log for retraining |
| Omission |
AI misses a critical finding |
Human review catches, feedback loop |
| Misclassification |
Wrong document type |
Reprocess with correct classifier |
| Format Error |
Invalid FHIR resource |
Reject, log, alert engineering |
5.2 Feedback Loop
Clinician flags error → Error logged with correction →
→ Added to training set → Model retrained → Improved accuracy
Use Cases: ML-003, OPS-303
6. Model Drift Detection
6.1 Monitoring Metrics
| Metric |
Threshold |
Alert |
| Accuracy drop |
>5% from baseline |
P1 alert |
| Confidence shift |
Mean drops >10% |
P2 alert |
| Rejection rate |
>20% increase |
P2 alert |
| Latency spike |
>2x baseline |
P1 alert |
6.2 Drift Response
- Detect: Automated monitoring flags drift
- Investigate: Review recent data/model changes
- Rollback: Revert to previous stable model if needed
- Retrain: Address root cause, update training data
- Deploy: Roll out fixed model with A/B testing
Use Cases: QAS-004
7. Compliance & Audit
7.1 Audit Requirements
| Requirement |
Implementation |
| DPDP Act |
Consent before AI processing, data retention limits |
| NABH |
Documentation of AI-assisted workflows |
| ABDM |
FHIR compliance for AI-generated resources |
| Internal |
Monthly accuracy audits, quarterly safety reviews |
7.2 Audit Trail Fields
{
"event_type": "ai_inference",
"timestamp": "2024-12-09T10:15:00Z",
"model": "ocr-tesseract-v5.3",
"model_version": "5.3.1-hindi",
"input_hash": "sha256:abc123...",
"output_hash": "sha256:def456...",
"confidence": 0.87,
"reviewed_by": null,
"approved": false,
"patient_id": "ABHA-12345",
"encounter_id": "ENC-2024-001234"
}
Use Cases: SEC-002
Document Owner: AI/ML Engineering Team + Clinical Safety
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with safety audits)