Operations Use Cases (OPS)¶

See Also: DevOps & SRE Implementation for CI/CD, monitoring, and infrastructure details.

UC-OPS-001: Monitor Pipeline Health¶

Purpose: DevOps dashboard.

Main Success Scenario:

Prometheus scrapes /metrics endpoint.
Grafana dashboard displays:
Queue Depths (HL7, OCR, ASR)
API Latency (p95, p99)
Error Rates (DLQ counts)
If Queue Depth > 100, AlertManager pages On-Call.

Acceptance Criteria:

[ ] Metrics updated every 15s
[ ] Dashboards accessible to Ops team

UC-OPS-002: Review Failed Jobs¶

Purpose: Handle DLQ messages.

Main Success Scenario:

User logs into Admin Console.
Views "Dead Letter Queue".
Selects a failed HL7 message.
Views error: "Patient Not Found".
Manually maps MRN to ABHA ID.
Clicks "Replay".
System re-queues message.

Acceptance Criteria:

[ ] Allows viewing raw payload of failed jobs
[ ] Supports bulk replay

UC-OPS-003: Rotate API Keys¶

Purpose: Cycle partner API credentials without downtime.

Property	Value
Actor	Platform Engineer
Trigger	Quarterly rotation schedule
Priority	P0

Main Success Scenario:

1. Engineer requests new key via Admin Console
2. System generates key pair, stores hash in secrets manager
3. Marks key as `staged` and shares via secure channel
4. Partner confirms cutover via `/api/keys/activate`
5. System marks old key as `grace` for 24h then revokes
6. Audit log records rotation with ticket reference

Acceptance Criteria:

[ ] Supports overlapping validity windows
[ ] Emits alerts if partner uses revoked key
[ ] Keys encrypted at rest and masked in UI/logs

UC-OPS-004: Reprocess Historical Backlog¶

Purpose: Replay stored raw messages through pipelines after fixes.

Property	Value
Actor	Data Engineer
Trigger	Bug fix requiring backfill
Priority	P1

Main Success Scenario:

1. Engineer selects date range + pipeline in Admin Console
2. System queries archive (UC-ING-010) for matching payloads
3. Replays messages into selected queue with `replay=true` flag
4. Throttles to configured rate (e.g., 100 msg/min)
5. Tracks replay progress and emits metrics
6. Generates reconciliation report (processed vs failed)

Acceptance Criteria:

[ ] Replay jobs idempotent (no duplicate bundles)
[ ] Supports pause/resume
[ ] Produces audit artifact attachable to RCA

UC-OPS-005: Manage Feature Flags¶

Purpose: Toggle beta features per tenant safely.

Property	Value
Actor	Product Ops
Trigger	Launch of new feature
Priority	P2

Main Success Scenario:

1. Product Ops opens Flags UI
2. Selects feature (e.g., "Document Summaries")
3. Chooses target segment (tenant, user role)
4. Sets rollout percentage (canary 10%)
5. System writes config to flag service (e.g., LaunchDarkly)
6. Observability dashboard tracks adoption + errors

Acceptance Criteria:

[ ] Supports instant rollback
[ ] Flag evaluations cached client-side < 5 min
[ ] Change log includes requester and justification

UC-OPS-006: Run Disaster Recovery Drill¶

Purpose: Validate RTO/RPO by simulating region failure.

Property	Value
Actor	SRE Lead
Trigger	Bi-annual drill
Priority	P0

Main Success Scenario:

1. Initiate failover runbook (disable primary region writes)
2. Restore latest backups to standby region
3. Switch DNS/traffic to standby
4. Execute smoke tests (API, ingestion, UI)
5. Record timelines vs RTO/RPO targets
6. Issue post-drill report with gaps and action items

Acceptance Criteria:

[ ] Drill completes within target RTO (<= 2h)
[ ] Data loss <= 5 min (RPO)
[ ] Findings tracked as Jira tickets with owners

UC-OPS-301: Job Queue & Worker Orchestration¶

Purpose: Manage the end-to-end pipeline execution from audio capture to EMR push.

Property	Value
Actor	Orchestrator Service
Trigger	New Audio Upload
Priority	P0

Main Success Scenario:

1. Receive job request with audio file
2. Push to GPU Queue for Whisper ASR
3. Upon completion, push transcript to LLM Queue for SOAP generation
4. Push structured output to EMR Queue for integration
5. Track state transitions in Redis/Postgres
6. Handle retries and dead-lettering for failed steps

Acceptance Criteria:

[ ] Zero data loss during handoffs
[ ] Supports priority queues (VIP doctors)
[ ] Auto-scaling of workers based on queue depth

UC-OPS-302: Monitor Inference Time & Failures¶

Purpose: Track real-time performance metrics for AI models.

Property	Value
Actor	Monitoring Agent
Trigger	Inference Completion
Priority	P1

Main Success Scenario:

1. Collect metrics: Real-Time Factor (RTF), Word Error Rate (WER) drift, Latency
2. Push to Prometheus/Grafana
3. Detect anomalies (e.g., sudden spike in timeouts)
4. Alert Ops team via PagerDuty
5. Log audio quality issues (SNR < threshold)

Acceptance Criteria:

[ ] Dashboards refresh < 15s
[ ] Alerts for RTF > 1.0
[ ] Granular breakdown by model version and dialect

UC-OPS-303: Human-in-the-Loop Correction¶

Purpose: Allow doctors to review and edit generated notes before finalization.

Property	Value
Actor	Doctor / Scribe
Trigger	Draft Note Generated
Priority	P1

Main Success Scenario:

1. Present draft SOAP note in UI
2. Highlight low-confidence entities
3. Doctor edits text or accepts suggestions
4. Capture edits as "correction signal" for RLHF
5. Submit final version to EMR

Acceptance Criteria:

[ ] Edit history preserved
[ ] "Diff" stored for model improvement
[ ] One-click approval for high-confidence notes

Related: Security | Quality & Safety | Processing