Skip to content

Safety & Evaluation

Document Purpose: This document defines how Entheory.AI ensures clinical safety, validates AI model outputs, and maintains human oversight for all AI-generated content.


1. Safety Design Principles

1.1 Core Principles

Principle Description
No Autonomous Decisions AI suggests, humans decide. No AI output directly affects patient care without clinician review.
Transparency All AI outputs show confidence scores and source provenance.
Fail-Safe Low-confidence outputs routed to manual review queues.
Auditability Every AI inference logged with model version, timestamp, input/output.
Reversibility All AI-generated content can be edited or rejected by clinicians.

1.2 Human-in-the-Loop Design

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ AI Model    │ ──▶ │ Confidence  │ ──▶ │ Review      │ ──▶ │ EMR/Action  │
│ Inference   │     │ Assessment  │     │ Queue       │     │ (Approved)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    High Confidence?
                    ├── Yes: Fast-track (still needs approval)
                    └── No: Manual Review Required

2. Validation Framework

2.1 Input Validation

Check Purpose Use Case
File Type Reject executables, unexpected formats API upload
Size Limits Prevent resource exhaustion 10MB per document, 60 min audio
Language Detection Route to correct model OCR/ASR preprocessing
Quality Assessment Flag low-quality inputs Blurry scans, noisy audio

2.2 Output Validation

Layer Validation Action if Failed
Schema FHIR R4 compliance Reject, log error
Clinical Medical code validity (ICD-10, RxNorm) Flag for review
Confidence Below threshold (e.g., <70%) Route to manual queue
Consistency Cross-check with existing patient data Highlight discrepancies
Use Case ID Name Validation Type
QAS-001 Record Model Failures Error tracking
QAS-002 Perform Clinical Safety Review Human review
QAS-003 Track Audit Violations Compliance
QAS-004 Model Drift Detection Quality monitoring

3. Model Evaluation

3.1 Accuracy Metrics

Model Metric Target Measurement
OCR Character Accuracy >90% Edit distance vs ground truth
ASR Word Error Rate (WER) <15% Levenshtein distance
NER F1 Score >85% Precision/Recall on labeled data
Classification Accuracy >90% Confusion matrix analysis

3.2 Benchmark Datasets

Dataset Size Purpose
OCR-Medical-EN 5,000 pages English medical document OCR
OCR-Medical-HI 2,000 pages Hindi/bilingual document OCR
ASR-Clinical-EN 100 hours English clinical audio
ASR-Clinical-HI 50 hours Hindi/Hinglish clinical audio
NER-Oncology 10,000 annotations Cancer-specific entities

3.3 Evaluation Pipeline

1. Sample production data (anonymized)
2. Generate model predictions
3. Compare against human annotations
4. Compute accuracy metrics
5. Identify error categories
6. Feed back to retraining pipeline

Use Cases: ML-002


4. Confidence Calibration

4.1 Confidence Score Interpretation

Score Range Interpretation Action
90-100% High confidence Fast-track approval
70-89% Medium confidence Standard review
50-69% Low confidence Mandatory manual review
<50% Very low confidence Flag as failed, require reprocessing

4.2 Confidence Display

{
  "extraction": "Paclitaxel 175mg/m²",
  "confidence": 0.92,
  "source": {
    "document": "chemo_order_2024-01-15.pdf",
    "page": 2,
    "coordinates": {"x": 120, "y": 340, "w": 200, "h": 30}
  },
  "model_version": "ner-med-v2.3.1"
}

4.3 Visual Indicators

Confidence UI Indicator
High (90%+) Green checkmark
Medium (70-89%) Yellow warning
Low (50-69%) Orange flag + highlight
Very Low (<50%) Red alert + manual review required

5. Error Handling

5.1 Error Categories

Category Example Handling
Hallucination AI invents a medication not in source Flag for review, log for retraining
Omission AI misses a critical finding Human review catches, feedback loop
Misclassification Wrong document type Reprocess with correct classifier
Format Error Invalid FHIR resource Reject, log, alert engineering

5.2 Feedback Loop

Clinician flags error → Error logged with correction → 
→ Added to training set → Model retrained → Improved accuracy

Use Cases: ML-003, OPS-303


6. Model Drift Detection

6.1 Monitoring Metrics

Metric Threshold Alert
Accuracy drop >5% from baseline P1 alert
Confidence shift Mean drops >10% P2 alert
Rejection rate >20% increase P2 alert
Latency spike >2x baseline P1 alert

6.2 Drift Response

  1. Detect: Automated monitoring flags drift
  2. Investigate: Review recent data/model changes
  3. Rollback: Revert to previous stable model if needed
  4. Retrain: Address root cause, update training data
  5. Deploy: Roll out fixed model with A/B testing

Use Cases: QAS-004


7. Compliance & Audit

7.1 Audit Requirements

Requirement Implementation
DPDP Act Consent before AI processing, data retention limits
NABH Documentation of AI-assisted workflows
ABDM FHIR compliance for AI-generated resources
Internal Monthly accuracy audits, quarterly safety reviews

7.2 Audit Trail Fields

{
  "event_type": "ai_inference",
  "timestamp": "2024-12-09T10:15:00Z",
  "model": "ocr-tesseract-v5.3",
  "model_version": "5.3.1-hindi",
  "input_hash": "sha256:abc123...",
  "output_hash": "sha256:def456...",
  "confidence": 0.87,
  "reviewed_by": null,
  "approved": false,
  "patient_id": "ABHA-12345",
  "encounter_id": "ENC-2024-001234"
}

Use Cases: SEC-002



Document Owner: AI/ML Engineering Team + Clinical Safety
Last Updated: 2024-12-09
Next Review: Quarterly (aligned with safety audits)