DevOps & SRE¶

Document Purpose: This document outlines the DevOps practices, CI/CD pipelines, infrastructure, monitoring, and site reliability engineering (SRE) practices for the Entheory.AI platform.

Executive Summary¶

Entheory.AI follows modern DevOps and SRE practices to ensure reliable, secure, and rapid software delivery. Our infrastructure supports both on-premises hospital deployments and cloud-hosted environments.

Related Documentation: - Operations Use Cases – Operational workflows - Ops & Knowledge Overview – Operational capabilities - Playbooks – Runbooks for common scenarios

1. Environment Architecture¶

1.1 Environment Overview¶

flowchart LR
    subgraph Dev["Development"]
        DEV_APP["App Server"]
        DEV_DB["Dev DB"]
        DEV_Q["Queue"]
    end

    subgraph Staging["Staging"]
        STG_APP["App Server"]
        STG_DB["Staging DB"]
        STG_Q["Queue"]
    end

    subgraph Prod["Production"]
        PROD_LB["Load Balancer"]
        PROD_APP1["App Server 1"]
        PROD_APP2["App Server 2"]
        PROD_DB["Primary DB"]
        PROD_DB_R["Replica DB"]
        PROD_Q["Queue Cluster"]
    end

    Dev -->|Promote| Staging -->|Promote| Prod

    style Dev fill:#ccffcc
    style Staging fill:#ffffcc
    style Prod fill:#ffcccc

1.2 Environment Details¶

Environment	Purpose	URL	Data	Access
Development	Feature development, testing	`dev.entheory.local`	Synthetic/anonymized	All engineers
Staging	Pre-production testing, UAT	`staging.entheory.ai`	Subset of anonymized prod	QA + Select engineers
Production	Live patient data processing	`app.entheory.ai`	Real PHI	Authorized only
DR Site	Disaster recovery (warm standby)	`dr.entheory.ai`	Replicated from prod	Emergency only

1.3 Environment Parity¶

Aspect	Dev	Staging	Prod
Infrastructure	Single-node Docker	Multi-node, similar to prod	Full HA cluster
Data	Synthetic (100 patients)	Anonymized (1000 patients)	Real (10,000+ patients)
Integrations	Mock HL7/FHIR	Test endpoints	Live hospital systems
SSL	Self-signed	Let's Encrypt	AWS ACM
Monitoring	Local logs	Full monitoring	Full monitoring + alerting

2. CI/CD Pipeline¶

2.1 Pipeline Overview¶

flowchart LR
    subgraph Source["Source"]
        GIT["GitHub"]
    end

    subgraph Build["Build & Test"]
        LINT["Lint & Format"]
        UNIT["Unit Tests"]
        BUILD["Docker Build"]
        SAST["Security Scan"]
    end

    subgraph Deploy["Deploy"]
        DEV_D["Deploy Dev"]
        STG_D["Deploy Staging"]
        PROD_D["Deploy Prod"]
    end

    subgraph Verify["Verify"]
        SMOKE["Smoke Tests"]
        E2E["E2E Tests"]
        PERF["Performance"]
    end

    GIT --> LINT --> UNIT --> BUILD --> SAST
    SAST --> DEV_D --> SMOKE
    SMOKE -->|Manual Approval| STG_D --> E2E
    E2E -->|Manual Approval| PROD_D --> PERF

    style Source fill:#9cf
    style Build fill:#fc9
    style Deploy fill:#9fc
    style Verify fill:#f9c

2.2 Pipeline Stages¶

Stage	Tools	Duration	Actions
Lint & Format	ESLint, Prettier, Black	~1 min	Code style checks
Unit Tests	Jest, pytest	~3 min	Unit test execution
Build	Docker, npm	~5 min	Build containers
Security Scan	SonarQube, Snyk	~3 min	SAST, dependency scan
Deploy Dev	GitHub Actions	~2 min	Auto-deploy to dev
Smoke Tests	Playwright	~2 min	Critical path tests
Deploy Staging	GitHub Actions	~2 min	Manual trigger
E2E Tests	Cypress	~10 min	Full end-to-end
Deploy Prod	GitHub Actions	~5 min	Blue-green deployment
Performance	k6, Artillery	~5 min	Load testing

2.3 GitHub Actions Workflow¶

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:

      - uses: actions/checkout@v4
      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - run: npm ci
      - run: npm run lint
      - run: npm run test:unit

  build:
    needs: lint-and-test
    runs-on: ubuntu-latest
    steps:

      - uses: actions/checkout@v4
      - name: Build Docker image
        run: docker build -t entheory-app:${{ github.sha }} .

      - name: Security scan
        uses: snyk/actions/docker@master
        with:
          image: entheory-app:${{ github.sha }}

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging
    steps:

      - name: Deploy to Staging
        run: ./deploy.sh staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:

      - name: Deploy to Production
        run: ./deploy.sh production

3. Infrastructure¶

3.1 Tech Stack¶

Component	Technology	Purpose
Containerization	Docker	Application packaging
Orchestration	Kubernetes / Docker Compose	Container management
Load Balancer	NGINX / AWS ALB	Traffic distribution
Database	PostgreSQL 15	Primary data store
Cache	Redis 7	Session + query cache
Queue	Kafka / NATS / RabbitMQ	Message queuing
Object Storage	S3 / MinIO	PDFs, audio, DICOM
Monitoring	Prometheus + Grafana	Metrics and dashboards
Logging	Loki / ELK Stack	Centralized logs
Secrets	HashiCorp Vault / AWS Secrets Manager	Credential management

3.2 Deployment Options¶

Option	Infrastructure	Use Case	Management
On-Premises	Hospital data center VMs	Data residency requirements	Hospital IT + Entheory
Private Cloud	AWS/Azure VPC with VPN	Hybrid model	Entheory managed
SaaS	Shared multi-tenant	Small clinics	Fully managed

3.3 Kubernetes Architecture (Cloud)¶

flowchart TB
    subgraph Ingress["Ingress"]
        ALB["AWS ALB"]
        WAF["WAF"]
    end

    subgraph K8s["Kubernetes Cluster"]
        subgraph Frontend["Frontend Pods"]
            FE1["Web App"]
            FE2["Web App"]
        end

        subgraph Backend["Backend Pods"]
            API1["API Server"]
            API2["API Server"]
        end

        subgraph Workers["Worker Pods"]
            OCR["OCR Worker"]
            ASR["ASR Worker"]
            NLP["NLP Worker"]
        end
    end

    subgraph Data["Data Layer"]
        RDS["RDS PostgreSQL"]
        REDIS["ElastiCache Redis"]
        S3["S3 Bucket"]
        SQS["SQS Queues"]
    end

    ALB --> WAF --> Frontend
    Frontend --> Backend
    Backend --> RDS & REDIS & S3
    Backend --> SQS
    SQS --> Workers
    Workers --> S3 & RDS

    style Ingress fill:#ff9
    style K8s fill:#9cf
    style Data fill:#f9c

4. Monitoring & Observability¶

4.1 Metrics (Prometheus + Grafana)¶

Category	Metrics	Alert Threshold
API	Request rate, latency (p50, p95, p99), error rate	p99 > 2s, errors > 1%
Queue	Depth, processing rate, DLQ size	Depth > 1000, DLQ > 10
Database	Connections, query time, replication lag	Lag > 10s
Infrastructure	CPU, memory, disk, network	CPU > 80%, Disk > 85%
Business	Patients processed, OCR accuracy	Accuracy < 85%

4.2 Dashboards¶

Dashboard	Audience	Key Metrics
System Health	SRE/DevOps	CPU, memory, error rates, uptime
API Performance	Backend team	Latency, throughput, slow endpoints
Pipeline Status	Data team	Queue depths, processing times, failures
Business Metrics	Product/Leadership	Active users, patients processed, data coverage

4.3 Alerting¶

flowchart LR
    subgraph Sources["Alert Sources"]
        PROM["Prometheus"]
        LOGS["Log Alerts"]
        SYNTH["Synthetic Monitors"]
    end

    subgraph Rules["Alert Manager"]
        ROUTE["Routing Rules"]
        DEDUP["Deduplication"]
        SILENCE["Silence Rules"]
    end

    subgraph Channels["Notification Channels"]
        PD["PagerDuty"]
        SLACK["Slack"]
        EMAIL["Email"]
    end

    subgraph Responders["Responders"]
        ONCALL["On-Call Engineer"]
        TEAM["Team Channel"]
        MGMT["Management"]
    end

    Sources --> Rules --> Channels --> Responders

4.4 Alert Severity¶

Severity	Response	Notification	Examples
P0 Critical	5 min	PagerDuty + Phone	Site down, data breach
P1 High	15 min	PagerDuty	API errors > 5%, DB down
P2 Medium	1 hour	Slack	High latency, queue backup
P3 Low	Next day	Email	Disk warning, minor errors

5. Logging¶

5.1 Log Aggregation¶

Component	Logs	Retention
Application	JSON structured logs	30 days hot, 1 year archive
Access Logs	NGINX/ALB access logs	90 days
Audit Logs	Security and compliance	7 years (immutable)
System Logs	OS and container logs	14 days

5.2 Log Format (Structured JSON)¶

{
  "timestamp": "2024-12-09T10:30:00.123Z",
  "level": "INFO",
  "service": "api-server",
  "traceId": "abc123xyz",
  "spanId": "span456",
  "userId": "dr_aditi_001",
  "message": "Patient record accessed",
  "context": {
    "patientId": "ABHA-12345",
    "action": "VIEW",
    "responseTime": 145
  }
}

5.3 Log Queries (Common)¶

Query	Purpose
`level:ERROR service:api-server`	API errors
`action:VIEW userId:* \\| stats by userId`	Access patterns
`responseTime:>1000`	Slow requests
`traceId:abc123xyz`	Trace a request

6. SRE Practices¶

6.1 SLIs, SLOs, and SLAs¶

Service	SLI	SLO	SLA
API Availability	Uptime percentage	99.9% monthly	99.5%
API Latency	p99 response time	< 500ms	< 2s
Data Ingestion	HL7 message processing time	< 5 min	< 15 min
OCR Processing	Document processing time	< 60s	< 5 min
Data Durability	Data loss incidents	0	0

6.2 Error Budget¶

Error Budget = 1 - SLO = 1 - 0.999 = 0.1% = 43.2 min/month downtime allowed

Month	Downtime	Budget Used	Remaining
Oct 2024	5 min	11.6%	88.4%
Nov 2024	0 min	0%	100%
Dec 2024	2 min	4.6%	95.4%

6.3 Incident Management¶

See: Incident Response Playbook

Step	Owner	Duration
Detection	Monitoring	Automated
Triage	On-call	5 min
Escalation	On-call	If needed
Mitigation	Incident Commander	ASAP
Resolution	Engineering	Variable
Post-Mortem	SRE Lead	Within 48h

7. Disaster Recovery¶

7.1 RPO/RTO Targets¶

Tier	RPO (Data Loss)	RTO (Downtime)	Scope
Tier 1 (Critical)	5 min	1 hour	Patient data, auth
Tier 2 (Important)	1 hour	4 hours	Processing queues
Tier 3 (Standard)	24 hours	24 hours	Logs, analytics

7.2 Backup Strategy¶

Data	Method	Frequency	Retention
PostgreSQL	pg_dump + WAL archiving	Continuous WAL, daily full	30 days
Object Storage	S3 versioning + cross-region replication	Real-time	90 days
Configuration	Git + encrypted secrets backup	On change	Forever
Kubernetes State	Velero snapshots	Daily	14 days

7.3 DR Drill Schedule¶

Drill Type	Frequency	Last Run	Next Run
Tabletop	Quarterly	Nov 2024	Feb 2025
Backup Restore	Monthly	Dec 2024	Jan 2025
Full Failover	Annually	Jun 2024	Jun 2025

See: Disaster Recovery Playbook

8. On-Call Rotation¶

8.1 Schedule¶

Role	Hours	Rotation
Primary On-Call	24/7	Weekly
Secondary On-Call	Escalation backup	Weekly
Incident Commander	Major incidents only	As needed

8.2 Escalation Path¶

Alert → Primary On-Call (5 min) → Secondary (10 min) → Engineering Manager (15 min) → CTO (30 min)

8.3 On-Call Tools¶

Tool	Purpose
PagerDuty	Alert routing and escalation
Slack #incidents	Incident coordination
Zoom/Meet	War room for major incidents
Runbook Links	Quick access to playbooks

9. Security in DevOps¶

9.1 DevSecOps Practices¶

Stage	Security Control
Code	Pre-commit hooks, secret scanning
Build	SAST (SonarQube), dependency scanning (Snyk)
Test	DAST (ZAP), penetration testing
Deploy	Image signing, vulnerability scanning
Runtime	WAF, runtime protection, anomaly detection

9.2 Container Security¶

Control	Implementation
Base Images	Distroless / Alpine (minimal attack surface)
Vulnerability Scanning	Trivy in CI, ECR scanning
Image Signing	Cosign / Notary
Runtime Security	Falco for runtime monitoring
Network Policies	Kubernetes NetworkPolicy isolation

Operations Use Cases – Operational workflows
Security & Privacy – Security architecture
Playbooks – Operational runbooks
High-Level Architecture – System design

Document Owner: DevOps/SRE Team
Last Updated: 2024-12-09
Next Review: Quarterly