Ops & Knowledge Overview
Document Purpose: This document provides an overview of operational capabilities, knowledge management, and support infrastructure for the Entheory.AI platform. It covers the operational use cases, monitoring practices, and cross-cutting knowledge required to run the platform reliably.
Executive Summary
Entheory.AI's operational infrastructure is designed for hospital-grade reliability—99.5% uptime SLA, comprehensive audit trails, and proactive monitoring. This document maps operational capabilities to specific use cases and provides guidance for day-to-day operations.
Incident Response Workflow
stateDiagram-v2
[*] --> Detected: Alert Triggered
Detected --> Acknowledged: On-call acknowledges
Detected --> Escalated: No ack in 5 min
Acknowledged --> Investigating: Start investigation
Escalated --> Investigating: Escalation team investigates
Investigating --> Mitigating: Root cause identified
Investigating --> Escalated: >30 min, no progress
Mitigating --> Resolved: Fix deployed
Mitigating --> Rollback: Fix failed
Rollback --> Mitigating: Retry fix
Resolved --> PostMortem: Within 48 hours
PostMortem --> [*]: RCA documented
note right of Detected
P0: 5 min response
P1: 15 min response
P2: Next business day
end note
1. Operational Capabilities by Use Case
1.1 Core Operations (9 Use Cases)
| Use Case ID |
Name |
Priority |
Description |
| OPS-001 |
Monitor Pipeline Health |
P0 |
Real-time monitoring of ingestion, processing, and integration pipelines |
| OPS-002 |
Review Failed Jobs |
P0 |
Dashboard and workflows for triaging failed pipeline jobs |
| OPS-003 |
Rotate API Keys |
P0 |
Secure rotation of API credentials without downtime |
| OPS-004 |
Reprocess Historical Backlog |
P1 |
Bulk reprocessing of historical data after fixes or upgrades |
| OPS-005 |
Manage Feature Flags |
P2 |
Controlled rollout of features via feature flags |
| OPS-006 |
Run Disaster Recovery Drill |
P0 |
Scheduled DR exercises to validate backup and recovery |
| OPS-301 |
Job Queue & Worker Orchestration |
P0 |
Manage distributed job queues and worker scaling |
| OPS-302 |
Monitor Inference Time & Failures |
P1 |
Track OCR/ASR model performance and failure rates |
| OPS-303 |
Human-in-the-Loop Correction |
P1 |
Manual review queue for low-confidence AI outputs |
1.2 Security & Compliance Operations (7 Use Cases)
| Use Case ID |
Name |
Priority |
Description |
| SEC-001 |
Enforce Data Retention Policies |
P0 |
Automated data lifecycle management per DPDP Act |
| SEC-002 |
Audit Access Trails |
P0 |
Comprehensive logging of all data access events |
| SEC-003 |
Process Consent Revocation |
P0 |
Handle patient requests to revoke data processing consent |
| SEC-004 |
Detect Anomalous Login Patterns |
P1 |
ML-based detection of suspicious access patterns |
| SEC-401a |
Verify Doctor Authorization |
P0 |
Role-based access control for clinicians |
| SEC-401b |
Verify Patient Consent |
P0 |
Real-time consent verification before data access |
| SEC-402 |
Data Encryption & Masking |
P0 |
At-rest and in-transit encryption, PII masking |
1.3 Quality & Safety Operations (4 Use Cases)
| Use Case ID |
Name |
Priority |
Description |
| QAS-001 |
Record Model Failures |
P0 |
Track OCR/ASR/NLP model failures for retraining |
| QAS-002 |
Perform Clinical Safety Review |
P0 |
Clinical review workflows for AI-generated content |
| QAS-003 |
Track Audit Violations |
P1 |
Monitor compliance violations and remediation |
| QAS-004 |
Model Drift Detection |
P1 |
Detect performance degradation in deployed models |
2. Monitoring & Observability
2.1 Dashboard Overview
| Dashboard |
Purpose |
Key Metrics |
| Pipeline Health |
Overall ingestion/processing status |
Job success rate, queue depth, processing latency |
| Integration Status |
External system connectivity |
HL7/FHIR endpoint uptime, message flow rates |
| Model Performance |
OCR/ASR/NLP accuracy |
Accuracy %, failure rate, avg confidence scores |
| Security & Compliance |
Access patterns and audit status |
Login anomalies, consent coverage, audit completion |
| Clinical Alerts |
Active alerts requiring attention |
Critical lab alerts, treatment delays, unacknowledged items |
2.2 Alert Tiers
| Tier |
Response Time |
Examples |
| P0 - Critical |
15 minutes |
Integration down, data loss risk, security incident |
| P1 - High |
1 hour |
Elevated failure rates, consent expiry, model degradation |
| P2 - Medium |
4 hours |
Queue backlog, feature flag issues, minor integration errors |
| P3 - Low |
Next business day |
Dashboard cosmetic issues, non-critical logs cleanup |
2.3 Key SLIs/SLOs
| Service Level Indicator |
Target SLO |
| Platform Availability |
99.5% uptime |
| API Response Time (p95) |
<2 seconds |
| Ingestion Latency (HL7 → Bundle) |
<30 seconds |
| OCR Processing Time |
<60 seconds per document |
| ASR Processing Time |
<2x audio duration |
| Job Failure Rate |
<1% |
| Data Completeness |
>80% of modalities ingested |
3. On-Call & Incident Response
3.1 On-Call Structure
- Primary On-Call: Platform engineer (rotating weekly)
- Secondary On-Call: Backend engineer (escalation backup)
- Clinical Escalation: Medical informatics lead (for clinical safety issues)
- Leadership Escalation: CTO/Founder (for P0 incidents lasting >30 minutes)
3.2 Incident Severity Classification
| Severity |
Definition |
Examples |
| SEV-1 |
Complete platform outage or data loss |
Database failure, complete integration down |
| SEV-2 |
Major feature unavailable |
OCR pipeline down, timeline not loading |
| SEV-3 |
Partial degradation |
Slow response times, intermittent errors |
| SEV-4 |
Minor issue |
UI glitch, non-critical job failures |
3.3 Incident Response Process
- Detection → Alert fires or user report received
- Acknowledgement → On-call acknowledges within SLA
- Triage → Severity classification, impact assessment
- Mitigation → Apply immediate fix or workaround
- Resolution → Root cause fix deployed and verified
- Post-Mortem → Document lessons learned (SEV-1/2 only)
4. Key Knowledge Areas
4.1 Domain Knowledge
4.2 Technical Knowledge
5. Operational Runbooks Summary
Detailed runbooks are maintained in Playbooks. Key runbooks include:
| Runbook |
Purpose |
Trigger |
| Incident Response |
Handle platform incidents |
P0/P1 alert |
| Data Backfill |
Reprocess historical data |
Post-upgrade, data fix |
| Hospital Onboarding |
Deploy to new hospital |
New customer |
| Integration Debugging |
Troubleshoot HL7/FHIR issues |
Integration errors |
| Model Rollback |
Revert to previous OCR/ASR model |
Model degradation detected |
| DR Drill |
Execute disaster recovery test |
Monthly scheduled |
| Key Rotation |
Rotate API keys/certificates |
Scheduled or security incident |
| Consent Processing |
Handle consent revocation requests |
Patient request |
6. Cross-Functional Dependencies
| Function |
Ops Dependency |
Communication Channel |
| Engineering |
Deploy coordination, feature flags |
Slack #eng-ops |
| Clinical Team |
Safety review escalations |
Slack #clinical-safety |
| Customer Success |
Hospital escalations |
Slack #customer-ops |
| Security |
Incident response, audit requests |
Slack #security |
| Leadership |
SEV-1 escalations, DR status |
Email + Phone |
7. Continuous Improvement
7.1 Regular Reviews
| Review |
Frequency |
Focus |
| SLO Review |
Weekly |
Service level compliance, error budget burn |
| Incident Review |
After each SEV-1/2 |
Root cause, prevention measures |
| Capacity Planning |
Monthly |
Infrastructure scaling needs |
| Model Performance |
Weekly |
OCR/ASR accuracy trends, retraining needs |
| Compliance Audit |
Quarterly |
DPDP, NABH, ABDM alignment |
7.2 Improvement Backlog
Track operational improvements in:
- Automation: Reduce manual toil
- Observability: Better alerts and dashboards
- Resilience: Chaos engineering, DR improvements
- Documentation: Runbook updates, knowledge base expansion
Document Owner: Platform Operations Team
Last Updated: 2024-12-09
Next Review: Post-pilot deployment review