Safety & Evaluation¶

Document Purpose: This document defines how Entheory.AI ensures clinical safety, validates AI model outputs, and maintains human oversight for all AI-generated content.

1. Safety Design Principles¶

1.1 Core Principles¶

Principle	Description
No Autonomous Decisions	AI suggests, humans decide. No AI output directly affects patient care without clinician review.
Transparency	All AI outputs show confidence scores and source provenance.
Fail-Safe	Low-confidence outputs routed to manual review queues.
Auditability	Every AI inference logged with model version, timestamp, input/output.
Reversibility	All AI-generated content can be edited or rejected by clinicians.

1.2 Human-in-the-Loop Design¶

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ AI Model    │ ──▶ │ Confidence  │ ──▶ │ Review      │ ──▶ │ EMR/Action  │
│ Inference   │     │ Assessment  │     │ Queue       │     │ (Approved)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    High Confidence?
                    ├── Yes: Fast-track (still needs approval)
                    └── No: Manual Review Required

2. Validation Framework¶

2.1 Input Validation¶

Check	Purpose	Use Case
File Type	Reject executables, unexpected formats	API upload
Size Limits	Prevent resource exhaustion	10MB per document, 60 min audio
Language Detection	Route to correct model	OCR/ASR preprocessing
Quality Assessment	Flag low-quality inputs	Blurry scans, noisy audio

2.2 Output Validation¶

Layer	Validation	Action if Failed
Schema	FHIR R4 compliance	Reject, log error
Clinical	Medical code validity (ICD-10, RxNorm)	Flag for review
Confidence	Below threshold (e.g., <70%)	Route to manual queue
Consistency	Cross-check with existing patient data	Highlight discrepancies

2.3 Use Case Links¶

Use Case ID	Name	Validation Type
QAS-001	Record Model Failures	Error tracking
QAS-002	Perform Clinical Safety Review	Human review
QAS-003	Track Audit Violations	Compliance
QAS-004	Model Drift Detection	Quality monitoring

3. Model Evaluation¶

3.1 Accuracy Metrics¶

Model	Metric	Target	Measurement
OCR	Character Accuracy	>90%	Edit distance vs ground truth
ASR	Word Error Rate (WER)	<15%	Levenshtein distance
NER	F1 Score	>85%	Precision/Recall on labeled data
Classification	Accuracy	>90%	Confusion matrix analysis

3.2 Benchmark Datasets¶

Dataset	Size	Purpose
OCR-Medical-EN	5,000 pages	English medical document OCR
OCR-Medical-HI	2,000 pages	Hindi/bilingual document OCR
ASR-Clinical-EN	100 hours	English clinical audio
ASR-Clinical-HI	50 hours	Hindi/Hinglish clinical audio
NER-Oncology	10,000 annotations	Cancer-specific entities

3.3 Evaluation Pipeline¶

1. Sample production data (anonymized)
2. Generate model predictions
3. Compare against human annotations
4. Compute accuracy metrics
5. Identify error categories
6. Feed back to retraining pipeline

Use Cases: ML-002

4. Confidence Calibration¶

4.1 Confidence Score Interpretation¶

Score Range	Interpretation	Action
90-100%	High confidence	Fast-track approval
70-89%	Medium confidence	Standard review
50-69%	Low confidence	Mandatory manual review
<50%	Very low confidence	Flag as failed, require reprocessing

4.2 Confidence Display¶

{
  "extraction": "Paclitaxel 175mg/m²",
  "confidence": 0.92,
  "source": {
    "document": "chemo_order_2024-01-15.pdf",
    "page": 2,
    "coordinates": {"x": 120, "y": 340, "w": 200, "h": 30}
  },
  "model_version": "ner-med-v2.3.1"
}

4.3 Visual Indicators¶

Confidence	UI Indicator
High (90%+)	Green checkmark
Medium (70-89%)	Yellow warning
Low (50-69%)	Orange flag + highlight
Very Low (<50%)	Red alert + manual review required

5. Error Handling¶

5.1 Error Categories¶

Category	Example	Handling
Hallucination	AI invents a medication not in source	Flag for review, log for retraining
Omission	AI misses a critical finding	Human review catches, feedback loop
Misclassification	Wrong document type	Reprocess with correct classifier
Format Error	Invalid FHIR resource	Reject, log, alert engineering

5.2 Feedback Loop¶

Clinician flags error → Error logged with correction → 
→ Added to training set → Model retrained → Improved accuracy

Use Cases: ML-003, OPS-303

6. Model Drift Detection¶

6.1 Monitoring Metrics¶

Metric	Threshold	Alert
Accuracy drop	>5% from baseline	P1 alert
Confidence shift	Mean drops >10%	P2 alert
Rejection rate	>20% increase	P2 alert
Latency spike	>2x baseline	P1 alert

6.2 Drift Response¶

Detect: Automated monitoring flags drift
Investigate: Review recent data/model changes
Rollback: Revert to previous stable model if needed
Retrain: Address root cause, update training data
Deploy: Roll out fixed model with A/B testing

Use Cases: QAS-004

7. Compliance & Audit¶

7.1 Audit Requirements¶

Requirement	Implementation
DPDP Act	Consent before AI processing, data retention limits
NABH	Documentation of AI-assisted workflows
ABDM	FHIR compliance for AI-generated resources
Internal	Monthly accuracy audits, quarterly safety reviews

7.2 Audit Trail Fields¶

{
  "event_type": "ai_inference",
  "timestamp": "2024-12-09T10:15:00Z",
  "model": "ocr-tesseract-v5.3",
  "model_version": "5.3.1-hindi",
  "input_hash": "sha256:abc123...",
  "output_hash": "sha256:def456...",
  "confidence": 0.87,
  "reviewed_by": null,
  "approved": false,
  "patient_id": "ABHA-12345",
  "encounter_id": "ENC-2024-001234"
}

Use Cases: SEC-002

AI & ML Overview – High-level AI capabilities
OCR & ASR – Engine configuration and benchmarks
Quality & Safety Use Cases – Detailed use cases
Security Use Cases – Compliance requirements
Ops Playbooks – Model rollback procedures

Document Owner: AI/ML Engineering Team + Clinical Safety
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with safety audits)