Operations Use Cases (OPS)¶
UC-OPS-001: Monitor Pipeline Health¶
Purpose: DevOps dashboard.
Main Success Scenario:
1. Prometheus scrapes /metrics endpoint.
2. Grafana dashboard displays:
- Queue Depths (HL7, OCR, ASR)
- API Latency (p95, p99)
- Error Rates (DLQ counts)
3. If Queue Depth > 100, AlertManager pages On-Call.
Acceptance Criteria: 1. [ ] Metrics updated every 15s 2. [ ] Dashboards accessible to Ops team
UC-OPS-002: Review Failed Jobs¶
Purpose: Handle DLQ messages.
Main Success Scenario: 1. User logs into Admin Console. 2. Views "Dead Letter Queue". 3. Selects a failed HL7 message. 4. Views error: "Patient Not Found". 5. Manually maps MRN to ABHA ID. 6. Clicks "Replay". 7. System re-queues message.
Acceptance Criteria: 1. [ ] Allows viewing raw payload of failed jobs 2. [ ] Supports bulk replay
UC-OPS-003: Rotate API Keys¶
Purpose: Cycle partner API credentials without downtime.
| Property | Value |
|---|---|
| Actor | Platform Engineer |
| Trigger | Quarterly rotation schedule |
| Priority | P0 |
Main Success Scenario:
1. Engineer requests new key via Admin Console
2. System generates key pair, stores hash in secrets manager
3. Marks key as `staged` and shares via secure channel
4. Partner confirms cutover via `/api/keys/activate`
5. System marks old key as `grace` for 24h then revokes
6. Audit log records rotation with ticket reference
Acceptance Criteria: 1. [ ] Supports overlapping validity windows 2. [ ] Emits alerts if partner uses revoked key 3. [ ] Keys encrypted at rest and masked in UI/logs
UC-OPS-004: Reprocess Historical Backlog¶
Purpose: Replay stored raw messages through pipelines after fixes.
| Property | Value |
|---|---|
| Actor | Data Engineer |
| Trigger | Bug fix requiring backfill |
| Priority | P1 |
Main Success Scenario:
1. Engineer selects date range + pipeline in Admin Console
2. System queries archive (UC-ING-010) for matching payloads
3. Replays messages into selected queue with `replay=true` flag
4. Throttles to configured rate (e.g., 100 msg/min)
5. Tracks replay progress and emits metrics
6. Generates reconciliation report (processed vs failed)
Acceptance Criteria: 1. [ ] Replay jobs idempotent (no duplicate bundles) 2. [ ] Supports pause/resume 3. [ ] Produces audit artifact attachable to RCA
UC-OPS-005: Manage Feature Flags¶
Purpose: Toggle beta features per tenant safely.
| Property | Value |
|---|---|
| Actor | Product Ops |
| Trigger | Launch of new feature |
| Priority | P2 |
Main Success Scenario:
1. Product Ops opens Flags UI
2. Selects feature (e.g., "Document Summaries")
3. Chooses target segment (tenant, user role)
4. Sets rollout percentage (canary 10%)
5. System writes config to flag service (e.g., LaunchDarkly)
6. Observability dashboard tracks adoption + errors
Acceptance Criteria: 1. [ ] Supports instant rollback 2. [ ] Flag evaluations cached client-side < 5 min 3. [ ] Change log includes requester and justification
UC-OPS-006: Run Disaster Recovery Drill¶
Purpose: Validate RTO/RPO by simulating region failure.
| Property | Value |
|---|---|
| Actor | SRE Lead |
| Trigger | Bi-annual drill |
| Priority | P0 |
Main Success Scenario:
1. Initiate failover runbook (disable primary region writes)
2. Restore latest backups to standby region
3. Switch DNS/traffic to standby
4. Execute smoke tests (API, ingestion, UI)
5. Record timelines vs RTO/RPO targets
6. Issue post-drill report with gaps and action items
Acceptance Criteria: 1. [ ] Drill completes within target RTO (<= 2h) 2. [ ] Data loss <= 5 min (RPO) 3. [ ] Findings tracked as Jira tickets with owners
UC-OPS-301: Job Queue & Worker Orchestration¶
Purpose: Manage the end-to-end pipeline execution from audio capture to EMR push.
| Property | Value |
|---|---|
| Actor | Orchestrator Service |
| Trigger | New Audio Upload |
| Priority | P0 |
Main Success Scenario:
1. Receive job request with audio file
2. Push to GPU Queue for Whisper ASR
3. Upon completion, push transcript to LLM Queue for SOAP generation
4. Push structured output to EMR Queue for integration
5. Track state transitions in Redis/Postgres
6. Handle retries and dead-lettering for failed steps
Acceptance Criteria: 1. [ ] Zero data loss during handoffs 2. [ ] Supports priority queues (VIP doctors) 3. [ ] Auto-scaling of workers based on queue depth
UC-OPS-302: Monitor Inference Time & Failures¶
Purpose: Track real-time performance metrics for AI models.
| Property | Value |
|---|---|
| Actor | Monitoring Agent |
| Trigger | Inference Completion |
| Priority | P1 |
Main Success Scenario:
1. Collect metrics: Real-Time Factor (RTF), Word Error Rate (WER) drift, Latency
2. Push to Prometheus/Grafana
3. Detect anomalies (e.g., sudden spike in timeouts)
4. Alert Ops team via PagerDuty
5. Log audio quality issues (SNR < threshold)
Acceptance Criteria: 1. [ ] Dashboards refresh < 15s 2. [ ] Alerts for RTF > 1.0 3. [ ] Granular breakdown by model version and dialect
UC-OPS-303: Human-in-the-Loop Correction¶
Purpose: Allow doctors to review and edit generated notes before finalization.
| Property | Value |
|---|---|
| Actor | Doctor / Scribe |
| Trigger | Draft Note Generated |
| Priority | P1 |
Main Success Scenario:
1. Present draft SOAP note in UI
2. Highlight low-confidence entities
3. Doctor edits text or accepts suggestions
4. Capture edits as "correction signal" for RLHF
5. Submit final version to EMR
Acceptance Criteria: 1. [ ] Edit history preserved 2. [ ] "Diff" stored for model improvement 3. [ ] One-click approval for high-confidence notes