Playbooks¶
Document Purpose: This document provides operational runbooks for common scenarios. Each playbook includes step-by-step procedures, escalation paths, and verification steps.
1. Incident Response Playbook¶
When to Use¶
- P0/P1 alert triggered
- Customer-reported production issue
- Security incident detected
Severity Classification¶
| Severity | Definition | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Complete platform outage or data loss | 15 minutes | DB failure, all integrations down |
| SEV-2 | Major feature unavailable | 1 hour | OCR pipeline down, timeline not loading |
| SEV-3 | Partial degradation | 4 hours | Slow response, intermittent errors |
| SEV-4 | Minor issue | Next business day | UI glitch, cosmetic issues |
Response Procedure¶
Step 1: Acknowledge (Within SLA)¶
[ ] Acknowledge alert in monitoring system
[ ] Post in #incidents Slack channel: "Investigating [ALERT_NAME] - [YOUR_NAME]"
[ ] Start incident timer for SLA tracking
Step 2: Assess Impact¶
[ ] Check affected services (Dashboard: Pipeline Health)
[ ] Identify customer impact scope (single hospital vs. all)
[ ] Classify severity using table above
[ ] Update incident channel with severity and impact summary
Step 3: Mitigate¶
[ ] Apply immediate fix or workaround
[ ] If rollback needed, follow Model Rollback or Deployment Rollback playbook
[ ] If data fix needed, document and execute with verification
[ ] Update incident channel every 15 minutes for SEV-1, 30 min for SEV-2
Step 4: Communicate¶
For SEV-1/2:
[ ] Notify Customer Success for customer communication
[ ] Update status page (if public)
[ ] Send summary email to leadership
For all severities:
[ ] Keep incident channel updated with progress
Step 5: Resolve & Verify¶
[ ] Deploy permanent fix
[ ] Verify fix in production (test cases, monitoring)
[ ] Mark incident as resolved in tracking system
[ ] Send resolution summary to incident channel
Step 6: Post-Mortem (SEV-1/2 only)¶
[ ] Schedule post-mortem meeting within 48 hours
[ ] Document timeline, root cause, contributing factors
[ ] Identify action items for prevention
[ ] Share post-mortem document with team
Escalation Path¶
- On-Call Engineer (Primary)
- Backend Lead (15 min escalation)
- Engineering Manager (30 min escalation)
- CTO (45 min escalation for SEV-1)
- CEO (1 hour for customer-impacting SEV-1)
2. Data Backfill Playbook¶
When to Use¶
- After bug fix affecting historical data
- After model upgrade requiring reprocessing
- Customer requests historical data reprocessing
- Use case: [OPS-004] Reprocess Historical Backlog
Pre-Backfill Checklist¶
[ ] Identify affected records (patient IDs, date range)
[ ] Estimate processing time and resource impact
[ ] Schedule during low-traffic window (2 AM - 6 AM IST)
[ ] Notify Operations team and Customer Success
[ ] Create backup of affected data
Backfill Procedure¶
Step 1: Prepare¶
# Identify records to reprocess
./scripts/identify_backfill_records.sh \
--start-date "2024-01-01" \
--end-date "2024-12-01" \
--criteria "document_type=pathology" \
--output backfill_manifest.json
# Validate manifest (review record count, sample records)
./scripts/validate_manifest.sh backfill_manifest.json
Step 2: Execute¶
# Start backfill with rate limiting
./scripts/run_backfill.sh \
--manifest backfill_manifest.json \
--concurrency 5 \
--rate-limit 100/min \
--dry-run # Remove for actual execution
# Monitor progress
./scripts/monitor_backfill.sh --job-id BACKFILL_JOB_ID
Step 3: Verify¶
[ ] Check completion status (100% processed)
[ ] Validate sample records (spot check 10-20 records)
[ ] Verify no regression in data quality
[ ] Compare before/after metrics
Step 4: Finalize¶
[ ] Document backfill in operations log
[ ] Notify stakeholders of completion
[ ] Archive manifest for audit trail
Rollback Plan¶
[ ] If errors detected, pause backfill immediately
[ ] Restore from pre-backfill backup
[ ] Document failure reason
[ ] Fix issue and re-plan backfill
3. Hospital Onboarding Playbook¶
When to Use¶
- New customer signed contract
- New pilot approved
- Expansion to new department/site
- Use cases: [OPS-001] through [OPS-006], [INT-001] through [INT-006]
Timeline Overview¶
| Week | Phase | Activities |
|---|---|---|
| 1 | Kickoff | Project planning, stakeholder mapping, system access |
| 2-3 | Integration | Develop/configure HL7, FHIR, PACS adapters |
| 4 | Testing | Integration testing, UAT, data validation |
| 5 | Training | Clinical user training, IT admin training |
| 6 | Go-Live | Soft launch, monitoring, support |
Week 1: Kickoff¶
Day 1-2: Project Setup¶
[ ] Create project folder in shared drive
[ ] Set up Slack channel: #project-[hospital-name]
[ ] Schedule kickoff meeting with hospital stakeholders
[ ] Assign internal team (PM, Engineer, CSM)
Day 2-3: Kickoff Meeting¶
Agenda:
[ ] Introductions and roles
[ ] Review scope and success metrics
[ ] Confirm timeline milestones
[ ] Identify stakeholder contacts
[ ] Discuss integration requirements
[ ] Schedule follow-up meetings
Day 4-5: System Access & Discovery¶
[ ] Obtain VPN/network access (if required)
[ ] Document EMR/LIS/PACS system details
- Vendor and version
- Interface type (HL7 v2.x, FHIR R4, file drop)
- Endpoint URLs or file paths
[ ] Review sample data files (anonymized)
[ ] Create integration specification document
Week 2-3: Integration Development¶
HL7 Integration¶
[ ] Configure HL7 listener endpoint
[ ] Map HL7 segments to internal data model
- PID → Patient demographics
- OBR → Orders
- OBX → Results
[ ] Implement message acknowledgment (ACK)
[ ] Test with sample messages
FHIR Integration¶
[ ] Configure FHIR client with hospital endpoint
[ ] Map FHIR resources to internal model
- Patient, Observation, DiagnosticReport, etc.
[ ] Implement webhook receiver (if push-based)
[ ] Test with sample bundles
PACS Integration¶
[ ] Configure DICOM listener or file watcher
[ ] Map DICOM metadata to patient identity
[ ] Set up image storage and viewer artifacts
[ ] Test with sample imaging studies
Week 4: Testing¶
Integration Testing¶
[ ] End-to-end message flow test (HL7/FHIR → Bundle)
[ ] PACS ingestion test (DICOM → Timeline)
[ ] Error handling test (malformed messages, timeouts)
[ ] Performance test (target: <30s ingestion latency)
User Acceptance Testing (UAT)¶
[ ] Schedule UAT sessions with clinical users
[ ] Prepare UAT test cases:
- Search for patient by name/ABHA ID
- View patient timeline
- View lab trends chart
- View imaging study
- Acknowledge clinical alert
[ ] Document UAT feedback and issues
[ ] Fix critical issues before go-live
Week 5: Training¶
Clinical User Training (2-3 hours)¶
Agenda:
[ ] Platform overview and login
[ ] Patient search and timeline navigation
[ ] Lab results and trends
[ ] Imaging viewer
[ ] Alerts and acknowledgment
[ ] Document upload and annotation
[ ] Q&A and hands-on practice
IT Admin Training (1-2 hours)¶
Agenda:
[ ] Integration monitoring dashboard
[ ] Failed job review and retry
[ ] User management (add/remove users)
[ ] Basic troubleshooting
[ ] Escalation process
Week 6: Go-Live¶
Pre-Launch Checklist¶
[ ] All UAT issues resolved
[ ] Training completed for all user groups
[ ] Integration monitoring alerts configured
[ ] Support escalation path documented
[ ] Customer Success introduced to stakeholders
Soft Launch (Days 1-3)¶
[ ] Enable access for pilot user group (5-10 users)
[ ] Monitor closely for errors and feedback
[ ] Daily check-in calls with pilot users
[ ] Fix any emerging issues immediately
Full Go-Live (Days 4-7)¶
[ ] Expand access to all target users
[ ] Monitor adoption metrics (logins, usage patterns)
[ ] Continue daily monitoring for first week
[ ] Transition to regular support model
Post-Launch¶
[ ] Week 2: Collect initial feedback, address issues
[ ] Week 4: First review meeting with stakeholders
[ ] Week 8: Formal pilot review with metrics
[ ] Ongoing: Monthly check-ins and QBRs
4. Integration Debugging Playbook¶
When to Use¶
- HL7/FHIR message ingestion failures
- PACS file drop not detected
- Data appearing in wrong patient record
- Use cases: [INT-001] through [INT-006], [ING-001] through [ING-020]
Common Issues & Resolutions¶
Issue: HL7 Messages Not Being Received¶
Diagnosis Steps:
[ ] Check HL7 listener service status
[ ] Verify network connectivity to hospital endpoint
[ ] Check firewall rules (port typically 2575)
[ ] Review listener logs for connection attempts
[ ] Test with sample message using HAPI Test Panel
Resolution:
- If service down: Restart HL7 listener, alert if not auto-recovered
- If network issue: Coordinate with hospital IT
- If firewall: Request firewall rule update
Issue: HL7 Message Parse Errors¶
Diagnosis Steps:
[ ] Retrieve failed message from DLQ
[ ] Identify parse error (segment, field, encoding)
[ ] Compare against expected HL7 format
[ ] Check for hospital-specific message variations
Resolution:
- If format variation: Update parser mapping
- If encoding issue: Adjust character set handling
- If required field missing: Add validation handling
- Reprocess failed message after fix
Issue: Patient Identity Mismatch¶
Diagnosis Steps:
[ ] Compare patient identifiers in source message vs. bundle
[ ] Check for duplicate patient records
[ ] Verify ABHA ID normalization logic
[ ] Review MPI (Master Patient Index) matching rules
Resolution:
- If duplicate: Merge patient records (with audit)
- If identifier mismatch: Update normalization logic
- If MPI issue: Adjust matching algorithm
Issue: PACS Files Not Detected¶
Diagnosis Steps:
[ ] Check file watcher service status
[ ] Verify file drop path is accessible
[ ] Check for file permission issues
[ ] Review watcher logs for errors
Resolution:
- If service down: Restart watcher
- If path issue: Correct configuration
- If permission: Fix file/folder permissions
- Backfill any missed files
5. Model Rollback Playbook¶
When to Use¶
- OCR/ASR accuracy degradation detected
- Model producing unexpected outputs
- Performance regression after model update
- Use case: [QAS-004] Model Drift Detection
Pre-Rollback Assessment¶
[ ] Confirm issue is model-related (not data or infrastructure)
[ ] Document current model version and deployment date
[ ] Identify previous stable model version
[ ] Estimate impact scope (affected documents/audio)
Rollback Procedure¶
Step 1: Prepare Rollback¶
[ ] Retrieve previous model artifacts from storage
[ ] Verify previous model is available and tested
[ ] Notify team of pending rollback
[ ] Schedule rollback during low-traffic period (if not emergency)
Step 2: Execute Rollback¶
# Switch to previous model version
./scripts/model_deploy.sh \
--model ocr-v1.2.3 \
--environment production \
--rollback
# Verify deployment
./scripts/verify_model.sh --model ocr-v1.2.3
Step 3: Verify Rollback¶
[ ] Test with sample documents/audio
[ ] Compare accuracy metrics to baseline
[ ] Monitor processing pipeline for errors
[ ] Confirm throughput returned to normal
Step 4: Reprocess Affected Data¶
[ ] Identify documents/audio processed with faulty model
[ ] Create backfill manifest for reprocessing
[ ] Execute backfill using Data Backfill Playbook
[ ] Notify stakeholders of reprocessed data
Step 5: Root Cause Analysis¶
[ ] Document what caused model degradation
[ ] Identify gaps in testing or monitoring
[ ] Update validation criteria for future deployments
[ ] Create action items to prevent recurrence
6. Disaster Recovery Drill Playbook¶
When to Use¶
- Monthly scheduled DR drill
- After significant infrastructure changes
- Regulatory audit preparation
- Use case: [OPS-006] Run Disaster Recovery Drill
DR Drill Types¶
| Type | Frequency | Scope |
|---|---|---|
| Tabletop | Monthly | Walk through procedures, no actual failover |
| Partial Failover | Quarterly | Failover non-critical components |
| Full Failover | Annually | Complete failover to DR site |
Drill Procedure (Tabletop)¶
Preparation (1 day before)¶
[ ] Notify participants (Ops, Eng, Leadership)
[ ] Prepare scenario description
[ ] Verify runbooks are up to date
[ ] Prepare checklist for walkthrough
Execution (1-2 hours)¶
[ ] Present scenario: "Primary region outage detected"
[ ] Walk through detection and alerting process
[ ] Review failover steps (read from runbook)
[ ] Discuss decision points and escalations
[ ] Identify gaps in procedures
[ ] Document questions and action items
Post-Drill¶
[ ] Send summary to participants
[ ] Create tickets for identified gaps
[ ] Update runbooks with improvements
[ ] Schedule next drill
Drill Procedure (Full Failover)¶
Pre-Drill (1 week before)¶
[ ] Schedule maintenance window (4-6 hours)
[ ] Notify customers of planned maintenance
[ ] Verify DR site is synced and ready
[ ] Prepare rollback plan
Execution¶
[ ] Put primary site in maintenance mode
[ ] Initiate failover to DR site
[ ] Verify all services functional in DR
[ ] Run smoke tests (patient search, timeline, ingestion)
[ ] Measure failover time (target: <1 hour)
[ ] Document any issues encountered
Failback¶
[ ] Re-sync primary site from DR
[ ] Initiate failback to primary
[ ] Verify all services functional
[ ] Resume normal operations
[ ] End maintenance window
Post-Drill¶
[ ] Document total downtime
[ ] Analyze issues and gaps
[ ] Update DR runbooks
[ ] Report to leadership and compliance
7. API Key Rotation Playbook¶
When to Use¶
- Scheduled key rotation (quarterly)
- Security incident requiring immediate rotation
- Employee offboarding with API access
- Use case: [OPS-003] Rotate API Keys
Rotation Procedure¶
Step 1: Generate New Keys¶
[ ] Log into API management console
[ ] Generate new API key/secret pair
[ ] Store new credentials in secrets manager
[ ] Do NOT revoke old key yet
Step 2: Update Integrations¶
[ ] Identify all services using the key
[ ] Update each service with new credentials
- HL7 listener
- FHIR client
- PACS file watcher
- Internal API clients
[ ] Deploy updated configurations
Step 3: Verify¶
[ ] Test each integration with new key
[ ] Confirm no auth failures in logs
[ ] Monitor for 24-48 hours
Step 4: Revoke Old Key¶
[ ] After verification period, revoke old key
[ ] Document rotation in operations log
[ ] Update key expiry tracking
Emergency Rotation (Compromised Key)¶
[ ] Immediately revoke compromised key
[ ] Generate and deploy new key (may cause brief outage)
[ ] Investigate scope of compromise
[ ] Notify security team
[ ] Update stakeholders on impact
8. Consent Revocation Processing¶
When to Use¶
- Patient requests data deletion/processing stop
- DPDP Act compliance requirement
- Use case: [SEC-003] Process Consent Revocation, [CONS-002] Revoke DPDP Consent
Processing Procedure¶
Step 1: Receive Request¶
[ ] Verify requestor identity (patient or authorized representative)
[ ] Log request in consent management system
[ ] Acknowledge receipt to requestor within 24 hours
Step 2: Identify Affected Data¶
[ ] Query all patient data using identifiers (ABHA ID, MRN)
[ ] List data categories:
- Clinical records
- Imaging studies
- Processed documents (OCR/ASR outputs)
- Audit logs (may be retained for compliance)
[ ] Document scope for processing
Step 3: Execute Revocation¶
[ ] Mark patient consent as revoked
[ ] Stop ongoing processing for patient
[ ] Depending on request type:
- Processing stop: Disable future processing
- Data deletion: Execute deletion workflow
[ ] Retain audit logs of consent and deletion (DPDP requirement)
Step 4: Verify & Confirm¶
[ ] Verify data is no longer accessible
[ ] Verify no future processing occurs
[ ] Send confirmation to requestor within 30 days
[ ] Document completion in consent system
Audit Trail¶
All consent revocation actions must be logged with:
- Requestor identity (verified)
- Request timestamp
- Data scope affected
- Actions taken
- Completion timestamp
- Operator who processed request
Document Owner: Platform Operations Team
Last Updated: 2024-12-09
Next Review: After each playbook is used (update with learnings)