Skip to content

Operations Use Cases (OPS)

UC-OPS-001: Monitor Pipeline Health

Purpose: DevOps dashboard.

Main Success Scenario: 1. Prometheus scrapes /metrics endpoint. 2. Grafana dashboard displays: - Queue Depths (HL7, OCR, ASR) - API Latency (p95, p99) - Error Rates (DLQ counts) 3. If Queue Depth > 100, AlertManager pages On-Call.

Acceptance Criteria: 1. [ ] Metrics updated every 15s 2. [ ] Dashboards accessible to Ops team


UC-OPS-002: Review Failed Jobs

Purpose: Handle DLQ messages.

Main Success Scenario: 1. User logs into Admin Console. 2. Views "Dead Letter Queue". 3. Selects a failed HL7 message. 4. Views error: "Patient Not Found". 5. Manually maps MRN to ABHA ID. 6. Clicks "Replay". 7. System re-queues message.

Acceptance Criteria: 1. [ ] Allows viewing raw payload of failed jobs 2. [ ] Supports bulk replay


UC-OPS-003: Rotate API Keys

Purpose: Cycle partner API credentials without downtime.

Property Value
Actor Platform Engineer
Trigger Quarterly rotation schedule
Priority P0

Main Success Scenario:

1. Engineer requests new key via Admin Console
2. System generates key pair, stores hash in secrets manager
3. Marks key as `staged` and shares via secure channel
4. Partner confirms cutover via `/api/keys/activate`
5. System marks old key as `grace` for 24h then revokes
6. Audit log records rotation with ticket reference

Acceptance Criteria: 1. [ ] Supports overlapping validity windows 2. [ ] Emits alerts if partner uses revoked key 3. [ ] Keys encrypted at rest and masked in UI/logs


UC-OPS-004: Reprocess Historical Backlog

Purpose: Replay stored raw messages through pipelines after fixes.

Property Value
Actor Data Engineer
Trigger Bug fix requiring backfill
Priority P1

Main Success Scenario:

1. Engineer selects date range + pipeline in Admin Console
2. System queries archive (UC-ING-010) for matching payloads
3. Replays messages into selected queue with `replay=true` flag
4. Throttles to configured rate (e.g., 100 msg/min)
5. Tracks replay progress and emits metrics
6. Generates reconciliation report (processed vs failed)

Acceptance Criteria: 1. [ ] Replay jobs idempotent (no duplicate bundles) 2. [ ] Supports pause/resume 3. [ ] Produces audit artifact attachable to RCA


UC-OPS-005: Manage Feature Flags

Purpose: Toggle beta features per tenant safely.

Property Value
Actor Product Ops
Trigger Launch of new feature
Priority P2

Main Success Scenario:

1. Product Ops opens Flags UI
2. Selects feature (e.g., "Document Summaries")
3. Chooses target segment (tenant, user role)
4. Sets rollout percentage (canary 10%)
5. System writes config to flag service (e.g., LaunchDarkly)
6. Observability dashboard tracks adoption + errors

Acceptance Criteria: 1. [ ] Supports instant rollback 2. [ ] Flag evaluations cached client-side < 5 min 3. [ ] Change log includes requester and justification


UC-OPS-006: Run Disaster Recovery Drill

Purpose: Validate RTO/RPO by simulating region failure.

Property Value
Actor SRE Lead
Trigger Bi-annual drill
Priority P0

Main Success Scenario:

1. Initiate failover runbook (disable primary region writes)
2. Restore latest backups to standby region
3. Switch DNS/traffic to standby
4. Execute smoke tests (API, ingestion, UI)
5. Record timelines vs RTO/RPO targets
6. Issue post-drill report with gaps and action items

Acceptance Criteria: 1. [ ] Drill completes within target RTO (<= 2h) 2. [ ] Data loss <= 5 min (RPO) 3. [ ] Findings tracked as Jira tickets with owners


UC-OPS-301: Job Queue & Worker Orchestration

Purpose: Manage the end-to-end pipeline execution from audio capture to EMR push.

Property Value
Actor Orchestrator Service
Trigger New Audio Upload
Priority P0

Main Success Scenario:

1. Receive job request with audio file
2. Push to GPU Queue for Whisper ASR
3. Upon completion, push transcript to LLM Queue for SOAP generation
4. Push structured output to EMR Queue for integration
5. Track state transitions in Redis/Postgres
6. Handle retries and dead-lettering for failed steps

Acceptance Criteria: 1. [ ] Zero data loss during handoffs 2. [ ] Supports priority queues (VIP doctors) 3. [ ] Auto-scaling of workers based on queue depth


UC-OPS-302: Monitor Inference Time & Failures

Purpose: Track real-time performance metrics for AI models.

Property Value
Actor Monitoring Agent
Trigger Inference Completion
Priority P1

Main Success Scenario:

1. Collect metrics: Real-Time Factor (RTF), Word Error Rate (WER) drift, Latency
2. Push to Prometheus/Grafana
3. Detect anomalies (e.g., sudden spike in timeouts)
4. Alert Ops team via PagerDuty
5. Log audio quality issues (SNR < threshold)

Acceptance Criteria: 1. [ ] Dashboards refresh < 15s 2. [ ] Alerts for RTF > 1.0 3. [ ] Granular breakdown by model version and dialect


UC-OPS-303: Human-in-the-Loop Correction

Purpose: Allow doctors to review and edit generated notes before finalization.

Property Value
Actor Doctor / Scribe
Trigger Draft Note Generated
Priority P1

Main Success Scenario:

1. Present draft SOAP note in UI
2. Highlight low-confidence entities
3. Doctor edits text or accepts suggestions
4. Capture edits as "correction signal" for RLHF
5. Submit final version to EMR

Acceptance Criteria: 1. [ ] Edit history preserved 2. [ ] "Diff" stored for model improvement 3. [ ] One-click approval for high-confidence notes