Skip to content

DevOps & SRE

Document Purpose: This document outlines the DevOps practices, CI/CD pipelines, infrastructure, monitoring, and site reliability engineering (SRE) practices for the Entheory.AI platform.


Executive Summary

Entheory.AI follows modern DevOps and SRE practices to ensure reliable, secure, and rapid software delivery. Our infrastructure supports both on-premises hospital deployments and cloud-hosted environments.

Related Documentation: - Operations Use Cases – Operational workflows - Ops & Knowledge Overview – Operational capabilities - Playbooks – Runbooks for common scenarios


1. Environment Architecture

1.1 Environment Overview

flowchart LR
    subgraph Dev["Development"]
        DEV_APP["App Server"]
        DEV_DB["Dev DB"]
        DEV_Q["Queue"]
    end

    subgraph Staging["Staging"]
        STG_APP["App Server"]
        STG_DB["Staging DB"]
        STG_Q["Queue"]
    end

    subgraph Prod["Production"]
        PROD_LB["Load Balancer"]
        PROD_APP1["App Server 1"]
        PROD_APP2["App Server 2"]
        PROD_DB["Primary DB"]
        PROD_DB_R["Replica DB"]
        PROD_Q["Queue Cluster"]
    end

    Dev -->|Promote| Staging -->|Promote| Prod

    style Dev fill:#ccffcc
    style Staging fill:#ffffcc
    style Prod fill:#ffcccc

1.2 Environment Details

Environment Purpose URL Data Access
Development Feature development, testing dev.entheory.local Synthetic/anonymized All engineers
Staging Pre-production testing, UAT staging.entheory.ai Subset of anonymized prod QA + Select engineers
Production Live patient data processing app.entheory.ai Real PHI Authorized only
DR Site Disaster recovery (warm standby) dr.entheory.ai Replicated from prod Emergency only

1.3 Environment Parity

Aspect Dev Staging Prod
Infrastructure Single-node Docker Multi-node, similar to prod Full HA cluster
Data Synthetic (100 patients) Anonymized (1000 patients) Real (10,000+ patients)
Integrations Mock HL7/FHIR Test endpoints Live hospital systems
SSL Self-signed Let's Encrypt AWS ACM
Monitoring Local logs Full monitoring Full monitoring + alerting

2. CI/CD Pipeline

2.1 Pipeline Overview

flowchart LR
    subgraph Source["Source"]
        GIT["GitHub"]
    end

    subgraph Build["Build & Test"]
        LINT["Lint & Format"]
        UNIT["Unit Tests"]
        BUILD["Docker Build"]
        SAST["Security Scan"]
    end

    subgraph Deploy["Deploy"]
        DEV_D["Deploy Dev"]
        STG_D["Deploy Staging"]
        PROD_D["Deploy Prod"]
    end

    subgraph Verify["Verify"]
        SMOKE["Smoke Tests"]
        E2E["E2E Tests"]
        PERF["Performance"]
    end

    GIT --> LINT --> UNIT --> BUILD --> SAST
    SAST --> DEV_D --> SMOKE
    SMOKE -->|Manual Approval| STG_D --> E2E
    E2E -->|Manual Approval| PROD_D --> PERF

    style Source fill:#9cf
    style Build fill:#fc9
    style Deploy fill:#9fc
    style Verify fill:#f9c

2.2 Pipeline Stages

Stage Tools Duration Actions
Lint & Format ESLint, Prettier, Black ~1 min Code style checks
Unit Tests Jest, pytest ~3 min Unit test execution
Build Docker, npm ~5 min Build containers
Security Scan SonarQube, Snyk ~3 min SAST, dependency scan
Deploy Dev GitHub Actions ~2 min Auto-deploy to dev
Smoke Tests Playwright ~2 min Critical path tests
Deploy Staging GitHub Actions ~2 min Manual trigger
E2E Tests Cypress ~10 min Full end-to-end
Deploy Prod GitHub Actions ~5 min Blue-green deployment
Performance k6, Artillery ~5 min Load testing

2.3 GitHub Actions Workflow

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:

      - uses: actions/checkout@v4
      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - run: npm ci
      - run: npm run lint
      - run: npm run test:unit

  build:
    needs: lint-and-test
    runs-on: ubuntu-latest
    steps:

      - uses: actions/checkout@v4
      - name: Build Docker image
        run: docker build -t entheory-app:${{ github.sha }} .

      - name: Security scan
        uses: snyk/actions/docker@master
        with:
          image: entheory-app:${{ github.sha }}

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging
    steps:

      - name: Deploy to Staging
        run: ./deploy.sh staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:

      - name: Deploy to Production
        run: ./deploy.sh production

3. Infrastructure

3.1 Tech Stack

Component Technology Purpose
Containerization Docker Application packaging
Orchestration Kubernetes / Docker Compose Container management
Load Balancer NGINX / AWS ALB Traffic distribution
Database PostgreSQL 15 Primary data store
Cache Redis 7 Session + query cache
Queue Kafka / NATS / RabbitMQ Message queuing
Object Storage S3 / MinIO PDFs, audio, DICOM
Monitoring Prometheus + Grafana Metrics and dashboards
Logging Loki / ELK Stack Centralized logs
Secrets HashiCorp Vault / AWS Secrets Manager Credential management

3.2 Deployment Options

Option Infrastructure Use Case Management
On-Premises Hospital data center VMs Data residency requirements Hospital IT + Entheory
Private Cloud AWS/Azure VPC with VPN Hybrid model Entheory managed
SaaS Shared multi-tenant Small clinics Fully managed

3.3 Kubernetes Architecture (Cloud)

flowchart TB
    subgraph Ingress["Ingress"]
        ALB["AWS ALB"]
        WAF["WAF"]
    end

    subgraph K8s["Kubernetes Cluster"]
        subgraph Frontend["Frontend Pods"]
            FE1["Web App"]
            FE2["Web App"]
        end

        subgraph Backend["Backend Pods"]
            API1["API Server"]
            API2["API Server"]
        end

        subgraph Workers["Worker Pods"]
            OCR["OCR Worker"]
            ASR["ASR Worker"]
            NLP["NLP Worker"]
        end
    end

    subgraph Data["Data Layer"]
        RDS["RDS PostgreSQL"]
        REDIS["ElastiCache Redis"]
        S3["S3 Bucket"]
        SQS["SQS Queues"]
    end

    ALB --> WAF --> Frontend
    Frontend --> Backend
    Backend --> RDS & REDIS & S3
    Backend --> SQS
    SQS --> Workers
    Workers --> S3 & RDS

    style Ingress fill:#ff9
    style K8s fill:#9cf
    style Data fill:#f9c

4. Monitoring & Observability

4.1 Metrics (Prometheus + Grafana)

Category Metrics Alert Threshold
API Request rate, latency (p50, p95, p99), error rate p99 > 2s, errors > 1%
Queue Depth, processing rate, DLQ size Depth > 1000, DLQ > 10
Database Connections, query time, replication lag Lag > 10s
Infrastructure CPU, memory, disk, network CPU > 80%, Disk > 85%
Business Patients processed, OCR accuracy Accuracy < 85%

4.2 Dashboards

Dashboard Audience Key Metrics
System Health SRE/DevOps CPU, memory, error rates, uptime
API Performance Backend team Latency, throughput, slow endpoints
Pipeline Status Data team Queue depths, processing times, failures
Business Metrics Product/Leadership Active users, patients processed, data coverage

4.3 Alerting

flowchart LR
    subgraph Sources["Alert Sources"]
        PROM["Prometheus"]
        LOGS["Log Alerts"]
        SYNTH["Synthetic Monitors"]
    end

    subgraph Rules["Alert Manager"]
        ROUTE["Routing Rules"]
        DEDUP["Deduplication"]
        SILENCE["Silence Rules"]
    end

    subgraph Channels["Notification Channels"]
        PD["PagerDuty"]
        SLACK["Slack"]
        EMAIL["Email"]
    end

    subgraph Responders["Responders"]
        ONCALL["On-Call Engineer"]
        TEAM["Team Channel"]
        MGMT["Management"]
    end

    Sources --> Rules --> Channels --> Responders

4.4 Alert Severity

Severity Response Notification Examples
P0 Critical 5 min PagerDuty + Phone Site down, data breach
P1 High 15 min PagerDuty API errors > 5%, DB down
P2 Medium 1 hour Slack High latency, queue backup
P3 Low Next day Email Disk warning, minor errors

5. Logging

5.1 Log Aggregation

Component Logs Retention
Application JSON structured logs 30 days hot, 1 year archive
Access Logs NGINX/ALB access logs 90 days
Audit Logs Security and compliance 7 years (immutable)
System Logs OS and container logs 14 days

5.2 Log Format (Structured JSON)

{
  "timestamp": "2024-12-09T10:30:00.123Z",
  "level": "INFO",
  "service": "api-server",
  "traceId": "abc123xyz",
  "spanId": "span456",
  "userId": "dr_aditi_001",
  "message": "Patient record accessed",
  "context": {
    "patientId": "ABHA-12345",
    "action": "VIEW",
    "responseTime": 145
  }
}

5.3 Log Queries (Common)

Query Purpose
level:ERROR service:api-server API errors
action:VIEW userId:* \| stats by userId Access patterns
responseTime:>1000 Slow requests
traceId:abc123xyz Trace a request

6. SRE Practices

6.1 SLIs, SLOs, and SLAs

Service SLI SLO SLA
API Availability Uptime percentage 99.9% monthly 99.5%
API Latency p99 response time < 500ms < 2s
Data Ingestion HL7 message processing time < 5 min < 15 min
OCR Processing Document processing time < 60s < 5 min
Data Durability Data loss incidents 0 0

6.2 Error Budget

Error Budget = 1 - SLO = 1 - 0.999 = 0.1% = 43.2 min/month downtime allowed
Month Downtime Budget Used Remaining
Oct 2024 5 min 11.6% 88.4%
Nov 2024 0 min 0% 100%
Dec 2024 2 min 4.6% 95.4%

6.3 Incident Management

See: Incident Response Playbook

Step Owner Duration
Detection Monitoring Automated
Triage On-call 5 min
Escalation On-call If needed
Mitigation Incident Commander ASAP
Resolution Engineering Variable
Post-Mortem SRE Lead Within 48h

7. Disaster Recovery

7.1 RPO/RTO Targets

Tier RPO (Data Loss) RTO (Downtime) Scope
Tier 1 (Critical) 5 min 1 hour Patient data, auth
Tier 2 (Important) 1 hour 4 hours Processing queues
Tier 3 (Standard) 24 hours 24 hours Logs, analytics

7.2 Backup Strategy

Data Method Frequency Retention
PostgreSQL pg_dump + WAL archiving Continuous WAL, daily full 30 days
Object Storage S3 versioning + cross-region replication Real-time 90 days
Configuration Git + encrypted secrets backup On change Forever
Kubernetes State Velero snapshots Daily 14 days

7.3 DR Drill Schedule

Drill Type Frequency Last Run Next Run
Tabletop Quarterly Nov 2024 Feb 2025
Backup Restore Monthly Dec 2024 Jan 2025
Full Failover Annually Jun 2024 Jun 2025

See: Disaster Recovery Playbook


8. On-Call Rotation

8.1 Schedule

Role Hours Rotation
Primary On-Call 24/7 Weekly
Secondary On-Call Escalation backup Weekly
Incident Commander Major incidents only As needed

8.2 Escalation Path

Alert → Primary On-Call (5 min) → Secondary (10 min) → Engineering Manager (15 min) → CTO (30 min)

8.3 On-Call Tools

Tool Purpose
PagerDuty Alert routing and escalation
Slack #incidents Incident coordination
Zoom/Meet War room for major incidents
Runbook Links Quick access to playbooks

9. Security in DevOps

9.1 DevSecOps Practices

Stage Security Control
Code Pre-commit hooks, secret scanning
Build SAST (SonarQube), dependency scanning (Snyk)
Test DAST (ZAP), penetration testing
Deploy Image signing, vulnerability scanning
Runtime WAF, runtime protection, anomaly detection

9.2 Container Security

Control Implementation
Base Images Distroless / Alpine (minimal attack surface)
Vulnerability Scanning Trivy in CI, ECR scanning
Image Signing Cosign / Notary
Runtime Security Falco for runtime monitoring
Network Policies Kubernetes NetworkPolicy isolation


Document Owner: DevOps/SRE Team
Last Updated: 2024-12-09
Next Review: Quarterly