High-Level Architecture¶
System Overview¶
Entheory.AI is built as a modular, event-driven architecture designed for: - Scalability: Handle 10,000+ oncology patients per instance - Interoperability: Ingest from multiple heterogeneous hospital systems - Resilience: Zero data loss, graceful degradation - Extensibility: Easy to add new data sources, processing pipelines
Architecture Diagram¶
┌────────────────────────────────────────────────────────────────────┐
│ PHYSICIAN LAYER │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ React Web Application (TypeScript) │ │
│ │ • Patient List • Timeline • Labs • Imaging View │ │
│ │ • Responsive Design • Real-time Updates │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
│ HTTPS/REST
↓
┌────────────────────────────────────────────────────────────────────┐
│ API GATEWAY LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ REST APIs │ │ Job Status │ │ FHIR Export │ │
│ │ /api/patients│ │ /api/jobs │ │ ?format=fhir │ │
│ │ /api/upload │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ Authentication & RBAC (JWT) │
└────────────────────────────────────────────────────────────────────┘
│
↓
┌────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Query Service │ │ Command Service │ │
│ │ • Read patient data│ │ • Ingest data │ │
│ │ • Cache layer │ │ • Update bundles │ │
│ │ • FHIR generation │ │ • Job management │ │
│ └─────────────────────┘ └─────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
↓ ↓ ↓
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ CANONICAL BUNDLES│ │ MESSAGE QUEUES │ │ OBJECT STORAGE │
│ (JSON Files) │ │ (RabbitMQ/SQS) │ │ (S3/MinIO) │
│ │ │ │ │ │
│ Per-patient JSON │ │ • Ingestion queue│ │ • PDFs │
│ bundle.json │ │ • OCR queue │ │ • Audio files │
│ │ │ • ASR queue │ │ • DICOM images │
│ processed_ │ │ • DLQ (errors) │ │ │
│ patients.json │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
↑ ↑ ↑
└─────────────────────┴──────────────────────┘
│
┌────────────────────────────────────────────────────────────────────┐
│ INGESTION & PROCESSING LAYER │
│ │
│ ┌────────────┐ ┌─────────┐ ┌──────────┐ ┌────────────┐ │
│ │ HL7 Listener│ │ File │ │ OCR │ │ ASR Worker │ │
│ │ (MLLP) │ │ Watchers│ │ Worker │ │ (Whisper) │ │
│ │ • ADT │ │ • PACS │ │(Tesseract│ │ • Audio │ │
│ │ • ORU (Labs)│ │ • Genomics │ • Eng+Hi │ │ • Eng+Hindi│ │
│ └────────────┘ └─────────┘ └──────────┘ └────────────┘ │
└────────────────────────────────────────────────────────────────────┘
↑ ↑ ↑ ↑
│ │ │ │
┌───────┴────────────────┴────────────────┴────────────────┴────────┐
│ EXTERNAL HOSPITAL SYSTEMS │
│ │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ EMR/HIMS │ │ LIS │ │ PACS │ │ Genomics Lab │ │
│ │ (HL7 ADT)│ │ (HL7 ORU)│ │ (JSON) │ │ (JSON) │ │
│ └──────────┘ └─────────┘ └──────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Component Details¶
1. Frontend (React Web App)¶
Tech Stack: - React 18 + TypeScript - React Router for navigation - Recharts for visualizations - Axios for API calls
Key Components:
- PatientList.tsx - Search and select patients
- PatientOverview.tsx - Summary view
- Timeline.tsx - Longitudinal event timeline
- LabsView.tsx - Lab results with trends
- ImagingView.tsx - Imaging studies
- NotesView.tsx - Multilingual clinical notes
State Management: - React Context for global state (current patient) - Local component state for UI interactions
Performance: - Code splitting per route - Lazy loading for large data tables - Caching API responses (stale-while-revalidate)
2. API Gateway¶
Framework: Express.js / FastAPI
Responsibilities: - Route HTTP requests to appropriate services - JWT authentication and RBAC enforcement - Rate limiting and request validation - CORS handling
Key Endpoints:
GET /api/patients # List patients
GET /api/patients/:abhaId # Get patient by ID
GET /api/patients/:abhaId?format=fhir # Get FHIR bundle
POST /api/upload/document # Upload for OCR
POST /api/upload/audio # Upload for ASR
POST /api/ingest/fhir # Ingest FHIR bundle
POST /api/ingest/hl7 # Ingest HL7 message
GET /api/jobs/:jobId # Job status
GET /api/datasources # Data source health
GET /health # System health check
3. Application Services¶
Query Service (Read Path)¶
Purpose: Fast reads for UI
Components:
- Cache Layer: Redis or in-memory cache for processed_patients.json
- Bundle Loader: Reads canonical patient bundles
- FHIR Transformer: Converts bundles to FHIR R4 on demand
Optimization: - Cache hit rate >90% for common queries - Precompute aggregations (test count, event count)
Command Service (Write Path)¶
Purpose: Data ingestion and bundle updates
Components: - HL7 Parser: Parse ADT, ORU messages - FHIR Parser: Validate and extract FHIR resources - JSON Ingester: Process file-based feeds - Bundle Updater: Atomically update patient bundles - Job Manager: Track async processing jobs
Guarantees: - Atomicity: Bundle updates are atomic (write temp, then rename) - Durability: All writes persisted to disk before ACK - Idempotency: Duplicate messages don't create duplicate data
4. Data Storage¶
Canonical Patient Bundles¶
Format: One JSON file per patient
Path: src/data/patients/<abhaId>/bundle.json
Schema:
{
"patientId": "case_001",
"abhaId": "ABHA-12345678901",
"demographics": { ... },
"cancer": { "site": "Breast", "stage": "IIB", ... },
"labs": [ ... ],
"imaging": [ ... ],
"pathology": [ ... ],
"genomics": [ ... ],
"therapy": [ ... ],
"medications": [ ... ],
"documents": [ ... ], // OCR outputs
"transcripts": [ ... ], // ASR outputs
"provenance": { "lastUpdated": "...", "sources": [...] }
}
Why JSON files: - Easy to inspect and debug - Version control friendly - No database schema migrations - Simple backup (file copy)
Future: May migrate to PostgreSQL + JSONB for better query performance at scale.
Processed Cache¶
Format: Single aggregated JSON
Path: src/data/processed_patients.json
Purpose: - Fast patient list queries - Precomputed summaries (test count, latest vitals) - Reduce Bundle I/O for list views
Regeneration: - Triggered after every bundle update - Incremental update (only changed patients)
Object Storage (S3/MinIO)¶
Stores: - Uploaded PDFs (OCR input) - Audio files (ASR input) - DICOM images (if downloaded from PACS) - Original HL7 messages (for audit/debugging)
Organization:
s3://entheory-hospital1/
├─ documents/
│ └─ <patientId>/
│ └─ doc_<timestamp>.pdf
├─ audio/
│ └─ <patientId>/
│ └─ audio_<timestamp>.mp3
├─ hl7/
│ └─ <date>/
│ └─ oru_<messageId>.txt
└─ dicom/
└─ <studyId>/
└─ <seriesId>/
└─ <instanceId>.dcm
5. Message Queues¶
Technology: RabbitMQ (preferred) or AWS SQS
Queues:
| Queue Name | Purpose | Consumer |
|---|---|---|
lab-ingestion-queue |
HL7 ORU lab messages | Lab ingestion worker |
imaging-ingestion-queue |
PACS JSON feeds | Imaging worker |
ocr-processing-queue |
Documents to OCR | OCR worker (Tesseract) |
asr-processing-queue |
Audio to transcribe | ASR worker (Whisper) |
hl7-dlq |
Failed HL7 messages | Manual review |
ocr-dlq |
Failed OCR jobs | Manual review |
asr-dlq |
Failed ASR jobs | Manual review |
Guarantees: - At-least-once delivery - Visibility timeout: 5 minutes (worker crashes, message requeued) - Dead letter queue for repeated failures (after 3 retries)
6. Processing Workers¶
OCR Worker¶
Technology: Tesseract 5.x
Process:
1. Dequeue job from ocr-processing-queue
2. Download PDF from S3
3. Detect language (langdetect)
4. Invoke Tesseract: tesseract input.pdf output -l eng|hin --oem 3 --psm 1
5. Extract text and confidence
6. Update patient bundle with extracted text
7. Validate bundle, regenerate cache
8. Mark job completed or failed
Parallelism: 10 workers, each processing 1 document at a time
ASR Worker¶
Technology: OpenAI Whisper (large-v3)
Process:
1. Dequeue job from asr-processing-queue
2. Download audio from S3
3. Invoke Whisper: whisper audio.mp3 --model large-v3 --language en|hi
4. Extract transcript with timestamps
5. Update patient bundle
6. Validate and cache
7. Mark job completed/failed
Parallelization: GPU-accelerated, 2-3 concurrent jobs (memory bound)
7. External Integrations¶
HL7 v2 Listener¶
Technology: MLLP (Minimal Lower Layer Protocol) TCP listener
Port: 2575 (configurable)
Message Types:
- ADT^A01 - Admission
- ADT^A03 - Discharge
- ORU^R01 - Lab results
Flow:
Hospital LIS → TCP/MLLP → HL7 Listener → Parse → Enqueue → ACK
Error Handling: - Malformed messages: Return NACK, log full message - Patient not found: Return ACK, move to DLQ for manual resolution
File Watchers¶
Technology: Chokidar (Node.js) or inotify (Linux)
Watched Directories:
/mnt/hospital-feeds/
├─ pacs/ # Imaging JSON files
├─ genomics/ # Genomics reports
└─ pathology/ # Pathology PDFs
Debounce: 5 seconds (wait for file write completion)
Process:
1. Detect new file
2. Validate JSON schema (if JSON)
3. Create ingestion job
4. Enqueue
5. Move processed file to /processed/<date>/
Data Flow Examples¶
Example 1: Lab Result Ingestion (HL7)¶
1. LIS sends HL7 ORU message via MLLP
↓
2. HL7 Listener receives, validates, sends ACK
↓
3. Parser extracts patient ID, test results
↓
4. Normalize to ABHA ID (query mapping table)
↓
5. Create job, enqueue to "lab-ingestion-queue"
↓
6. Worker dequeues, loads patient bundle
↓
7. Append lab results to bundle.labs[]
↓
8. Validate bundle (JSON schema)
↓
9. Atomically write updated bundle
↓
10. Regenerate processed_patients.json cache
↓
11. Generate FHIR Observation resources
↓
12. Job marked "completed"
Latency: 200-500ms end-to-end
Example 2: Document OCR (Hindi)¶
1. Physician uploads PDF via UI
↓
2. API validates file (size, type)
↓
3. API stores PDF in S3 with hash
↓
4. API creates OCR job, enqueues
↓
5. API returns 202 Accepted with jobId
↓
[Async Processing]
6. OCR worker dequeues job
↓
7. Worker downloads PDF from S3
↓
8. Worker detects language (Hindi)
↓
9. Worker runs Tesseract with "hin" pack
↓
10. Worker extracts text in Devanagari
↓
11. Worker calculates confidence (0.89)
↓
12. Worker updates bundle.documents[] with:
- extractedText
- language: "hi-IN"
- confidence: 0.89
↓
13. Worker validates bundle
↓
14. Worker regenerates cache
↓
15. Job marked "completed"
Latency: 30-60 seconds for typical 2-3 page document
Scalability & Performance¶
Current Limits (Single Instance)¶
| Resource | Capacity |
|---|---|
| Patients | 10,000 |
| Concurrent Users | 100 clinicians |
| API Throughput | 1000 req/min |
| HL7 Messages | 10,000/day |
| OCR Jobs | 500/day |
Scaling Strategies¶
Horizontal Scaling: - Deploy multiple API servers behind load balancer - Add more queue workers (OCR, ASR) - Shard file storage by hospital/patient ID range
Vertical Scaling: - Increase server RAM for in-memory cache - Add GPUs for faster ASR processing
Caching: - Redis for processed patient cache - CDN for static assets (frontend) - HTTP caching headers for patient API
Security Architecture¶
Authentication¶
- JWT tokens (RS256)
- Issued by hospital SSO/LDAP
- Expiration: 8 hours
- Refresh token flow
Authorization (RBAC)¶
- Roles: Oncologist, Nurse, Data Manager, Admin, Read-Only
- Permissions mapped per API endpoint
- Row-level security: Physicians see only their department's patients (configurable)
Encryption¶
- At Rest: AES-256 for bundles, S3 server-side encryption
- In Transit: TLS 1.3 for al
l connections - Backups: Encrypted with hospital-provided keys
Audit Logging¶
- Every patient data access logged
- Log fields: userId, patientId, action, timestamp, IP
- Immutable logs (append-only)
- Retention: 7 years (compliance requirement)
Deployment Architecture¶
Option 1: On-Premises (Hospital Data Center)¶
┌─────────────────────────────────────────┐
│ Hospital Network (10.x.x.x) │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Entheory.AI VM/Container │ │
│ │ • API Server │ │
│ │ • Workers (OCR/ASR) │ │
│ │ • RabbitMQ │ │
│ │ • File Storage (NFS/local disk) │ │
│ └──────────────────────────────────┘ │
│ ↕ │
│ ┌──────────────────────────────────┐ │
│ │ Hospital Systems │ │
│ │ • EMR (HL7 sender) │ │
│ │ • LIS (Labs) │ │
│ │ • PACS (file drop) │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Pros: Data never leaves hospital network, meets security policies
Cons: Hospital IT must maintain VM/containers
Option 2: Cloud (AWS/Azure) with VPN¶
┌──────────────────┐ ┌──────────────────────┐
│ Hospital Network │ │ Cloud VPC │
│ │ │ │
│ EMR, LIS, PACS │◄─────────┤ Entheory.AI App │
│ │ VPN/VPC │ • API Servers │
│ │ Peering │ • Workers (GPU) │
│ │ │ • S3, RDS │
└──────────────────┘ └──────────────────────┘
Pros: Managed services, GPU for ASR, easier scaling
Cons: Requires VPN setup, data governance approval
Monitoring & Observability¶
Metrics (Prometheus + Grafana)¶
System Metrics: - CPU, Memory, Disk usage per service - API latency (p50, p95, p99) - Queue depth and processing lag - Error rates per endpoint
Business Metrics: - Patients ingested per day - Data completeness per modality - OCR/ASR accuracy trends - Clinician active users
Logging (ELK/Loki)¶
Structured JSON logs:
{
"timestamp": "2024-12-03T10:15:30Z",
"level": "INFO",
"service": "ocr-worker",
"jobId": "ocr_job_789",
"event": "ocr_completed",
"language": "hi-IN",
"confidence": 0.89,
"duration_ms": 34500
}
Alerting (PagerDuty/Slack)¶
Critical Alerts: - API downtime >1 minute - DLQ depth >10 messages - Disk usage >90% - FHIR validation failure rate >10%
Warning Alerts: - OCR confidence <0.70 (manual review needed) - Queue lag >15 minutes - Cache hit rate <80%
Disaster Recovery¶
Backup Strategy¶
- Bundles: Daily automated backup to S3/Azure Blob (encrypted)
- Object Storage: Native S3 versioning enabled
- Audit Logs: Replicated to separate region
Recovery Procedures¶
- Data Loss: Restore from last night's backup (RPO: 24 hours)
- Server Failure: Redeploy from Docker image, mount backup storage (RTO: 2 hours)
Testing¶
- Quarterly disaster recovery drills
- Automated restore tests monthly
Document Owner: Tech Lead / Architect
Last Updated: 2024-12-03
Related: Data Model | APIs & Interoperability