DevOps & SRE
Document Purpose: This document outlines the DevOps practices, CI/CD pipelines, infrastructure, monitoring, and site reliability engineering (SRE) practices for the Entheory.AI platform.
Executive Summary
Entheory.AI follows modern DevOps and SRE practices to ensure reliable, secure, and rapid software delivery. Our infrastructure supports both on-premises hospital deployments and cloud-hosted environments.
Related Documentation:
- Operations Use Cases – Operational workflows
- Ops & Knowledge Overview – Operational capabilities
- Playbooks – Runbooks for common scenarios
1. Environment Architecture
1.1 Environment Overview
flowchart LR
subgraph Dev["Development"]
DEV_APP["App Server"]
DEV_DB["Dev DB"]
DEV_Q["Queue"]
end
subgraph Staging["Staging"]
STG_APP["App Server"]
STG_DB["Staging DB"]
STG_Q["Queue"]
end
subgraph Prod["Production"]
PROD_LB["Load Balancer"]
PROD_APP1["App Server 1"]
PROD_APP2["App Server 2"]
PROD_DB["Primary DB"]
PROD_DB_R["Replica DB"]
PROD_Q["Queue Cluster"]
end
Dev -->|Promote| Staging -->|Promote| Prod
style Dev fill:#ccffcc
style Staging fill:#ffffcc
style Prod fill:#ffcccc
1.2 Environment Details
| Environment |
Purpose |
URL |
Data |
Access |
| Development |
Feature development, testing |
dev.entheory.local |
Synthetic/anonymized |
All engineers |
| Staging |
Pre-production testing, UAT |
staging.entheory.ai |
Subset of anonymized prod |
QA + Select engineers |
| Production |
Live patient data processing |
app.entheory.ai |
Real PHI |
Authorized only |
| DR Site |
Disaster recovery (warm standby) |
dr.entheory.ai |
Replicated from prod |
Emergency only |
1.3 Environment Parity
| Aspect |
Dev |
Staging |
Prod |
| Infrastructure |
Single-node Docker |
Multi-node, similar to prod |
Full HA cluster |
| Data |
Synthetic (100 patients) |
Anonymized (1000 patients) |
Real (10,000+ patients) |
| Integrations |
Mock HL7/FHIR |
Test endpoints |
Live hospital systems |
| SSL |
Self-signed |
Let's Encrypt |
AWS ACM |
| Monitoring |
Local logs |
Full monitoring |
Full monitoring + alerting |
2. CI/CD Pipeline
2.1 Pipeline Overview
flowchart LR
subgraph Source["Source"]
GIT["GitHub"]
end
subgraph Build["Build & Test"]
LINT["Lint & Format"]
UNIT["Unit Tests"]
BUILD["Docker Build"]
SAST["Security Scan"]
end
subgraph Deploy["Deploy"]
DEV_D["Deploy Dev"]
STG_D["Deploy Staging"]
PROD_D["Deploy Prod"]
end
subgraph Verify["Verify"]
SMOKE["Smoke Tests"]
E2E["E2E Tests"]
PERF["Performance"]
end
GIT --> LINT --> UNIT --> BUILD --> SAST
SAST --> DEV_D --> SMOKE
SMOKE -->|Manual Approval| STG_D --> E2E
E2E -->|Manual Approval| PROD_D --> PERF
style Source fill:#9cf
style Build fill:#fc9
style Deploy fill:#9fc
style Verify fill:#f9c
2.2 Pipeline Stages
| Stage |
Tools |
Duration |
Actions |
| Lint & Format |
ESLint, Prettier, Black |
~1 min |
Code style checks |
| Unit Tests |
Jest, pytest |
~3 min |
Unit test execution |
| Build |
Docker, npm |
~5 min |
Build containers |
| Security Scan |
SonarQube, Snyk |
~3 min |
SAST, dependency scan |
| Deploy Dev |
GitHub Actions |
~2 min |
Auto-deploy to dev |
| Smoke Tests |
Playwright |
~2 min |
Critical path tests |
| Deploy Staging |
GitHub Actions |
~2 min |
Manual trigger |
| E2E Tests |
Cypress |
~10 min |
Full end-to-end |
| Deploy Prod |
GitHub Actions |
~5 min |
Blue-green deployment |
| Performance |
k6, Artillery |
~5 min |
Load testing |
2.3 GitHub Actions Workflow
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm run lint
- run: npm run test:unit
build:
needs: lint-and-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t entheory-app:${{ github.sha }} .
- name: Security scan
uses: snyk/actions/docker@master
with:
image: entheory-app:${{ github.sha }}
deploy-staging:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to Staging
run: ./deploy.sh staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to Production
run: ./deploy.sh production
3. Infrastructure
3.1 Tech Stack
| Component |
Technology |
Purpose |
| Containerization |
Docker |
Application packaging |
| Orchestration |
Kubernetes / Docker Compose |
Container management |
| Load Balancer |
NGINX / AWS ALB |
Traffic distribution |
| Database |
PostgreSQL 15 |
Primary data store |
| Cache |
Redis 7 |
Session + query cache |
| Queue |
Kafka / NATS / RabbitMQ |
Message queuing |
| Object Storage |
S3 / MinIO |
PDFs, audio, DICOM |
| Monitoring |
Prometheus + Grafana |
Metrics and dashboards |
| Logging |
Loki / ELK Stack |
Centralized logs |
| Secrets |
HashiCorp Vault / AWS Secrets Manager |
Credential management |
3.2 Deployment Options
| Option |
Infrastructure |
Use Case |
Management |
| On-Premises |
Hospital data center VMs |
Data residency requirements |
Hospital IT + Entheory |
| Private Cloud |
AWS/Azure VPC with VPN |
Hybrid model |
Entheory managed |
| SaaS |
Shared multi-tenant |
Small clinics |
Fully managed |
3.3 Kubernetes Architecture (Cloud)
flowchart TB
subgraph Ingress["Ingress"]
ALB["AWS ALB"]
WAF["WAF"]
end
subgraph K8s["Kubernetes Cluster"]
subgraph Frontend["Frontend Pods"]
FE1["Web App"]
FE2["Web App"]
end
subgraph Backend["Backend Pods"]
API1["API Server"]
API2["API Server"]
end
subgraph Workers["Worker Pods"]
OCR["OCR Worker"]
ASR["ASR Worker"]
NLP["NLP Worker"]
end
end
subgraph Data["Data Layer"]
RDS["RDS PostgreSQL"]
REDIS["ElastiCache Redis"]
S3["S3 Bucket"]
SQS["SQS Queues"]
end
ALB --> WAF --> Frontend
Frontend --> Backend
Backend --> RDS & REDIS & S3
Backend --> SQS
SQS --> Workers
Workers --> S3 & RDS
style Ingress fill:#ff9
style K8s fill:#9cf
style Data fill:#f9c
4. Monitoring & Observability
4.1 Metrics (Prometheus + Grafana)
| Category |
Metrics |
Alert Threshold |
| API |
Request rate, latency (p50, p95, p99), error rate |
p99 > 2s, errors > 1% |
| Queue |
Depth, processing rate, DLQ size |
Depth > 1000, DLQ > 10 |
| Database |
Connections, query time, replication lag |
Lag > 10s |
| Infrastructure |
CPU, memory, disk, network |
CPU > 80%, Disk > 85% |
| Business |
Patients processed, OCR accuracy |
Accuracy < 85% |
4.2 Dashboards
| Dashboard |
Audience |
Key Metrics |
| System Health |
SRE/DevOps |
CPU, memory, error rates, uptime |
| API Performance |
Backend team |
Latency, throughput, slow endpoints |
| Pipeline Status |
Data team |
Queue depths, processing times, failures |
| Business Metrics |
Product/Leadership |
Active users, patients processed, data coverage |
4.3 Alerting
flowchart LR
subgraph Sources["Alert Sources"]
PROM["Prometheus"]
LOGS["Log Alerts"]
SYNTH["Synthetic Monitors"]
end
subgraph Rules["Alert Manager"]
ROUTE["Routing Rules"]
DEDUP["Deduplication"]
SILENCE["Silence Rules"]
end
subgraph Channels["Notification Channels"]
PD["PagerDuty"]
SLACK["Slack"]
EMAIL["Email"]
end
subgraph Responders["Responders"]
ONCALL["On-Call Engineer"]
TEAM["Team Channel"]
MGMT["Management"]
end
Sources --> Rules --> Channels --> Responders
4.4 Alert Severity
| Severity |
Response |
Notification |
Examples |
| P0 Critical |
5 min |
PagerDuty + Phone |
Site down, data breach |
| P1 High |
15 min |
PagerDuty |
API errors > 5%, DB down |
| P2 Medium |
1 hour |
Slack |
High latency, queue backup |
| P3 Low |
Next day |
Email |
Disk warning, minor errors |
5. Logging
5.1 Log Aggregation
| Component |
Logs |
Retention |
| Application |
JSON structured logs |
30 days hot, 1 year archive |
| Access Logs |
NGINX/ALB access logs |
90 days |
| Audit Logs |
Security and compliance |
7 years (immutable) |
| System Logs |
OS and container logs |
14 days |
{
"timestamp": "2024-12-09T10:30:00.123Z",
"level": "INFO",
"service": "api-server",
"traceId": "abc123xyz",
"spanId": "span456",
"userId": "dr_aditi_001",
"message": "Patient record accessed",
"context": {
"patientId": "ABHA-12345",
"action": "VIEW",
"responseTime": 145
}
}
5.3 Log Queries (Common)
| Query |
Purpose |
level:ERROR service:api-server |
API errors |
action:VIEW userId:* \| stats by userId |
Access patterns |
responseTime:>1000 |
Slow requests |
traceId:abc123xyz |
Trace a request |
6. SRE Practices
6.1 SLIs, SLOs, and SLAs
| Service |
SLI |
SLO |
SLA |
| API Availability |
Uptime percentage |
99.9% monthly |
99.5% |
| API Latency |
p99 response time |
< 500ms |
< 2s |
| Data Ingestion |
HL7 message processing time |
< 5 min |
< 15 min |
| OCR Processing |
Document processing time |
< 60s |
< 5 min |
| Data Durability |
Data loss incidents |
0 |
0 |
6.2 Error Budget
Error Budget = 1 - SLO = 1 - 0.999 = 0.1% = 43.2 min/month downtime allowed
| Month |
Downtime |
Budget Used |
Remaining |
| Oct 2024 |
5 min |
11.6% |
88.4% |
| Nov 2024 |
0 min |
0% |
100% |
| Dec 2024 |
2 min |
4.6% |
95.4% |
6.3 Incident Management
See: Incident Response Playbook
| Step |
Owner |
Duration |
| Detection |
Monitoring |
Automated |
| Triage |
On-call |
5 min |
| Escalation |
On-call |
If needed |
| Mitigation |
Incident Commander |
ASAP |
| Resolution |
Engineering |
Variable |
| Post-Mortem |
SRE Lead |
Within 48h |
7. Disaster Recovery
7.1 RPO/RTO Targets
| Tier |
RPO (Data Loss) |
RTO (Downtime) |
Scope |
| Tier 1 (Critical) |
5 min |
1 hour |
Patient data, auth |
| Tier 2 (Important) |
1 hour |
4 hours |
Processing queues |
| Tier 3 (Standard) |
24 hours |
24 hours |
Logs, analytics |
7.2 Backup Strategy
| Data |
Method |
Frequency |
Retention |
| PostgreSQL |
pg_dump + WAL archiving |
Continuous WAL, daily full |
30 days |
| Object Storage |
S3 versioning + cross-region replication |
Real-time |
90 days |
| Configuration |
Git + encrypted secrets backup |
On change |
Forever |
| Kubernetes State |
Velero snapshots |
Daily |
14 days |
7.3 DR Drill Schedule
| Drill Type |
Frequency |
Last Run |
Next Run |
| Tabletop |
Quarterly |
Nov 2024 |
Feb 2025 |
| Backup Restore |
Monthly |
Dec 2024 |
Jan 2025 |
| Full Failover |
Annually |
Jun 2024 |
Jun 2025 |
See: Disaster Recovery Playbook
8. On-Call Rotation
8.1 Schedule
| Role |
Hours |
Rotation |
| Primary On-Call |
24/7 |
Weekly |
| Secondary On-Call |
Escalation backup |
Weekly |
| Incident Commander |
Major incidents only |
As needed |
8.2 Escalation Path
Alert → Primary On-Call (5 min) → Secondary (10 min) → Engineering Manager (15 min) → CTO (30 min)
| Tool |
Purpose |
| PagerDuty |
Alert routing and escalation |
| Slack #incidents |
Incident coordination |
| Zoom/Meet |
War room for major incidents |
| Runbook Links |
Quick access to playbooks |
9. Security in DevOps
9.1 DevSecOps Practices
| Stage |
Security Control |
| Code |
Pre-commit hooks, secret scanning |
| Build |
SAST (SonarQube), dependency scanning (Snyk) |
| Test |
DAST (ZAP), penetration testing |
| Deploy |
Image signing, vulnerability scanning |
| Runtime |
WAF, runtime protection, anomaly detection |
9.2 Container Security
| Control |
Implementation |
| Base Images |
Distroless / Alpine (minimal attack surface) |
| Vulnerability Scanning |
Trivy in CI, ECR scanning |
| Image Signing |
Cosign / Notary |
| Runtime Security |
Falco for runtime monitoring |
| Network Policies |
Kubernetes NetworkPolicy isolation |
Document Owner: DevOps/SRE Team
Last Updated: 2024-12-09
Next Review: Quarterly