Skip to content

Ops & Knowledge Overview

Document Purpose: This document provides an overview of operational capabilities, knowledge management, and support infrastructure for the Entheory.AI platform. It covers the operational use cases, monitoring practices, and cross-cutting knowledge required to run the platform reliably.


Executive Summary

Entheory.AI's operational infrastructure is designed for hospital-grade reliability—99.5% uptime SLA, comprehensive audit trails, and proactive monitoring. This document maps operational capabilities to specific use cases and provides guidance for day-to-day operations.

Incident Response Workflow

stateDiagram-v2
    [*] --> Detected: Alert Triggered

    Detected --> Acknowledged: On-call acknowledges
    Detected --> Escalated: No ack in 5 min

    Acknowledged --> Investigating: Start investigation
    Escalated --> Investigating: Escalation team investigates

    Investigating --> Mitigating: Root cause identified
    Investigating --> Escalated: >30 min, no progress

    Mitigating --> Resolved: Fix deployed
    Mitigating --> Rollback: Fix failed

    Rollback --> Mitigating: Retry fix

    Resolved --> PostMortem: Within 48 hours
    PostMortem --> [*]: RCA documented

    note right of Detected
        P0: 5 min response
        P1: 15 min response
        P2: Next business day
    end note

1. Operational Capabilities by Use Case

1.1 Core Operations (9 Use Cases)

Use Case ID Name Priority Description
OPS-001 Monitor Pipeline Health P0 Real-time monitoring of ingestion, processing, and integration pipelines
OPS-002 Review Failed Jobs P0 Dashboard and workflows for triaging failed pipeline jobs
OPS-003 Rotate API Keys P0 Secure rotation of API credentials without downtime
OPS-004 Reprocess Historical Backlog P1 Bulk reprocessing of historical data after fixes or upgrades
OPS-005 Manage Feature Flags P2 Controlled rollout of features via feature flags
OPS-006 Run Disaster Recovery Drill P0 Scheduled DR exercises to validate backup and recovery
OPS-301 Job Queue & Worker Orchestration P0 Manage distributed job queues and worker scaling
OPS-302 Monitor Inference Time & Failures P1 Track OCR/ASR model performance and failure rates
OPS-303 Human-in-the-Loop Correction P1 Manual review queue for low-confidence AI outputs

1.2 Security & Compliance Operations (7 Use Cases)

Use Case ID Name Priority Description
SEC-001 Enforce Data Retention Policies P0 Automated data lifecycle management per DPDP Act
SEC-002 Audit Access Trails P0 Comprehensive logging of all data access events
SEC-003 Process Consent Revocation P0 Handle patient requests to revoke data processing consent
SEC-004 Detect Anomalous Login Patterns P1 ML-based detection of suspicious access patterns
SEC-401a Verify Doctor Authorization P0 Role-based access control for clinicians
SEC-401b Verify Patient Consent P0 Real-time consent verification before data access
SEC-402 Data Encryption & Masking P0 At-rest and in-transit encryption, PII masking

1.3 Quality & Safety Operations (4 Use Cases)

Use Case ID Name Priority Description
QAS-001 Record Model Failures P0 Track OCR/ASR/NLP model failures for retraining
QAS-002 Perform Clinical Safety Review P0 Clinical review workflows for AI-generated content
QAS-003 Track Audit Violations P1 Monitor compliance violations and remediation
QAS-004 Model Drift Detection P1 Detect performance degradation in deployed models

2. Monitoring & Observability

2.1 Dashboard Overview

Dashboard Purpose Key Metrics
Pipeline Health Overall ingestion/processing status Job success rate, queue depth, processing latency
Integration Status External system connectivity HL7/FHIR endpoint uptime, message flow rates
Model Performance OCR/ASR/NLP accuracy Accuracy %, failure rate, avg confidence scores
Security & Compliance Access patterns and audit status Login anomalies, consent coverage, audit completion
Clinical Alerts Active alerts requiring attention Critical lab alerts, treatment delays, unacknowledged items

2.2 Alert Tiers

Tier Response Time Examples
P0 - Critical 15 minutes Integration down, data loss risk, security incident
P1 - High 1 hour Elevated failure rates, consent expiry, model degradation
P2 - Medium 4 hours Queue backlog, feature flag issues, minor integration errors
P3 - Low Next business day Dashboard cosmetic issues, non-critical logs cleanup

2.3 Key SLIs/SLOs

Service Level Indicator Target SLO
Platform Availability 99.5% uptime
API Response Time (p95) <2 seconds
Ingestion Latency (HL7 → Bundle) <30 seconds
OCR Processing Time <60 seconds per document
ASR Processing Time <2x audio duration
Job Failure Rate <1%
Data Completeness >80% of modalities ingested

3. On-Call & Incident Response

3.1 On-Call Structure

  • Primary On-Call: Platform engineer (rotating weekly)
  • Secondary On-Call: Backend engineer (escalation backup)
  • Clinical Escalation: Medical informatics lead (for clinical safety issues)
  • Leadership Escalation: CTO/Founder (for P0 incidents lasting >30 minutes)

3.2 Incident Severity Classification

Severity Definition Examples
SEV-1 Complete platform outage or data loss Database failure, complete integration down
SEV-2 Major feature unavailable OCR pipeline down, timeline not loading
SEV-3 Partial degradation Slow response times, intermittent errors
SEV-4 Minor issue UI glitch, non-critical job failures

3.3 Incident Response Process

  1. Detection → Alert fires or user report received
  2. Acknowledgement → On-call acknowledges within SLA
  3. Triage → Severity classification, impact assessment
  4. Mitigation → Apply immediate fix or workaround
  5. Resolution → Root cause fix deployed and verified
  6. Post-Mortem → Document lessons learned (SEV-1/2 only)

4. Key Knowledge Areas

4.1 Domain Knowledge

Area Topic Reference
Clinical Oncology terminology, TNM staging, RECIST criteria Medical Ontologies
Regulatory ABDM/ABHA, DPDP Act, NABH requirements Security Use Cases
Interoperability HL7 v2, FHIR R4, DICOM APIs & Interoperability
AI/ML OCR (Tesseract), ASR (Whisper), NLP pipelines AI & ML Overview

4.2 Technical Knowledge

Area Topic Reference
Architecture System components, data flow High-Level Architecture
Data Model Patient bundle structure, FHIR resources Data Model
Pipelines Ingestion, processing, integration flows Pipelines & Ingestion
Backend API design, job queues, database Backend Implementation
Frontend React UI, timeline visualization Frontend Implementation
DevOps CI/CD, infrastructure, monitoring DevOps & SRE

5. Operational Runbooks Summary

Detailed runbooks are maintained in Playbooks. Key runbooks include:

Runbook Purpose Trigger
Incident Response Handle platform incidents P0/P1 alert
Data Backfill Reprocess historical data Post-upgrade, data fix
Hospital Onboarding Deploy to new hospital New customer
Integration Debugging Troubleshoot HL7/FHIR issues Integration errors
Model Rollback Revert to previous OCR/ASR model Model degradation detected
DR Drill Execute disaster recovery test Monthly scheduled
Key Rotation Rotate API keys/certificates Scheduled or security incident
Consent Processing Handle consent revocation requests Patient request

6. Cross-Functional Dependencies

Function Ops Dependency Communication Channel
Engineering Deploy coordination, feature flags Slack #eng-ops
Clinical Team Safety review escalations Slack #clinical-safety
Customer Success Hospital escalations Slack #customer-ops
Security Incident response, audit requests Slack #security
Leadership SEV-1 escalations, DR status Email + Phone

7. Continuous Improvement

7.1 Regular Reviews

Review Frequency Focus
SLO Review Weekly Service level compliance, error budget burn
Incident Review After each SEV-1/2 Root cause, prevention measures
Capacity Planning Monthly Infrastructure scaling needs
Model Performance Weekly OCR/ASR accuracy trends, retraining needs
Compliance Audit Quarterly DPDP, NABH, ABDM alignment

7.2 Improvement Backlog

Track operational improvements in:

  • Automation: Reduce manual toil
  • Observability: Better alerts and dashboards
  • Resilience: Chaos engineering, DR improvements
  • Documentation: Runbook updates, knowledge base expansion

Document Owner: Platform Operations Team
Last Updated: 2024-12-09
Next Review: Post-pilot deployment review