Playbooks¶

Document Purpose: This document provides operational runbooks for common scenarios. Each playbook includes step-by-step procedures, escalation paths, and verification steps.

1. Incident Response Playbook¶

When to Use¶

P0/P1 alert triggered
Customer-reported production issue
Security incident detected

Severity Classification¶

Severity	Definition	Response Time	Examples
SEV-1	Complete platform outage or data loss	15 minutes	DB failure, all integrations down
SEV-2	Major feature unavailable	1 hour	OCR pipeline down, timeline not loading
SEV-3	Partial degradation	4 hours	Slow response, intermittent errors
SEV-4	Minor issue	Next business day	UI glitch, cosmetic issues

Response Procedure¶

Step 1: Acknowledge (Within SLA)¶

[ ] Acknowledge alert in monitoring system
[ ] Post in #incidents Slack channel: "Investigating [ALERT_NAME] - [YOUR_NAME]"
[ ] Start incident timer for SLA tracking

Step 2: Assess Impact¶

[ ] Check affected services (Dashboard: Pipeline Health)
[ ] Identify customer impact scope (single hospital vs. all)
[ ] Classify severity using table above
[ ] Update incident channel with severity and impact summary

Step 3: Mitigate¶

[ ] Apply immediate fix or workaround
[ ] If rollback needed, follow Model Rollback or Deployment Rollback playbook
[ ] If data fix needed, document and execute with verification
[ ] Update incident channel every 15 minutes for SEV-1, 30 min for SEV-2

Step 4: Communicate¶

For SEV-1/2:
[ ] Notify Customer Success for customer communication
[ ] Update status page (if public)
[ ] Send summary email to leadership

For all severities:
[ ] Keep incident channel updated with progress

Step 5: Resolve & Verify¶

[ ] Deploy permanent fix
[ ] Verify fix in production (test cases, monitoring)
[ ] Mark incident as resolved in tracking system
[ ] Send resolution summary to incident channel

Step 6: Post-Mortem (SEV-1/2 only)¶

[ ] Schedule post-mortem meeting within 48 hours
[ ] Document timeline, root cause, contributing factors
[ ] Identify action items for prevention
[ ] Share post-mortem document with team

Escalation Path¶

On-Call Engineer (Primary)
Backend Lead (15 min escalation)
Engineering Manager (30 min escalation)
CTO (45 min escalation for SEV-1)
CEO (1 hour for customer-impacting SEV-1)

2. Data Backfill Playbook¶

When to Use¶

After bug fix affecting historical data
After model upgrade requiring reprocessing
Customer requests historical data reprocessing
Use case: [OPS-004] Reprocess Historical Backlog

Pre-Backfill Checklist¶

[ ] Identify affected records (patient IDs, date range)
[ ] Estimate processing time and resource impact
[ ] Schedule during low-traffic window (2 AM - 6 AM IST)
[ ] Notify Operations team and Customer Success
[ ] Create backup of affected data

Backfill Procedure¶

Step 1: Prepare¶

# Identify records to reprocess
./scripts/identify_backfill_records.sh \
  --start-date "2024-01-01" \
  --end-date "2024-12-01" \
  --criteria "document_type=pathology" \
  --output backfill_manifest.json

# Validate manifest (review record count, sample records)
./scripts/validate_manifest.sh backfill_manifest.json

Step 2: Execute¶

# Start backfill with rate limiting
./scripts/run_backfill.sh \
  --manifest backfill_manifest.json \
  --concurrency 5 \
  --rate-limit 100/min \
  --dry-run # Remove for actual execution

# Monitor progress
./scripts/monitor_backfill.sh --job-id BACKFILL_JOB_ID

Step 3: Verify¶

[ ] Check completion status (100% processed)
[ ] Validate sample records (spot check 10-20 records)
[ ] Verify no regression in data quality
[ ] Compare before/after metrics

Step 4: Finalize¶

[ ] Document backfill in operations log
[ ] Notify stakeholders of completion
[ ] Archive manifest for audit trail

Rollback Plan¶

[ ] If errors detected, pause backfill immediately
[ ] Restore from pre-backfill backup
[ ] Document failure reason
[ ] Fix issue and re-plan backfill

3. Hospital Onboarding Playbook¶

When to Use¶

New customer signed contract
New pilot approved
Expansion to new department/site
Use cases: [OPS-001] through [OPS-006], [INT-001] through [INT-006]

Timeline Overview¶

Week	Phase	Activities
1	Kickoff	Project planning, stakeholder mapping, system access
2-3	Integration	Develop/configure HL7, FHIR, PACS adapters
4	Testing	Integration testing, UAT, data validation
5	Training	Clinical user training, IT admin training
6	Go-Live	Soft launch, monitoring, support

Week 1: Kickoff¶

Day 1-2: Project Setup¶

[ ] Create project folder in shared drive
[ ] Set up Slack channel: #project-[hospital-name]
[ ] Schedule kickoff meeting with hospital stakeholders
[ ] Assign internal team (PM, Engineer, CSM)

Day 2-3: Kickoff Meeting¶

Agenda:
[ ] Introductions and roles
[ ] Review scope and success metrics
[ ] Confirm timeline milestones
[ ] Identify stakeholder contacts
[ ] Discuss integration requirements
[ ] Schedule follow-up meetings

Day 4-5: System Access & Discovery¶

[ ] Obtain VPN/network access (if required)
[ ] Document EMR/LIS/PACS system details

  - Vendor and version
  - Interface type (HL7 v2.x, FHIR R4, file drop)
  - Endpoint URLs or file paths
[ ] Review sample data files (anonymized)
[ ] Create integration specification document

Week 2-3: Integration Development¶

HL7 Integration¶

[ ] Configure HL7 listener endpoint
[ ] Map HL7 segments to internal data model

  - PID → Patient demographics
  - OBR → Orders
  - OBX → Results
[ ] Implement message acknowledgment (ACK)
[ ] Test with sample messages

FHIR Integration¶

[ ] Configure FHIR client with hospital endpoint
[ ] Map FHIR resources to internal model

  - Patient, Observation, DiagnosticReport, etc.
[ ] Implement webhook receiver (if push-based)
[ ] Test with sample bundles

PACS Integration¶

[ ] Configure DICOM listener or file watcher
[ ] Map DICOM metadata to patient identity
[ ] Set up image storage and viewer artifacts
[ ] Test with sample imaging studies

Week 4: Testing¶

Integration Testing¶

[ ] End-to-end message flow test (HL7/FHIR → Bundle)
[ ] PACS ingestion test (DICOM → Timeline)
[ ] Error handling test (malformed messages, timeouts)
[ ] Performance test (target: <30s ingestion latency)

User Acceptance Testing (UAT)¶

[ ] Schedule UAT sessions with clinical users
[ ] Prepare UAT test cases:

  - Search for patient by name/ABHA ID
  - View patient timeline
  - View lab trends chart
  - View imaging study
  - Acknowledge clinical alert
[ ] Document UAT feedback and issues
[ ] Fix critical issues before go-live

Week 5: Training¶

Clinical User Training (2-3 hours)¶

Agenda:
[ ] Platform overview and login
[ ] Patient search and timeline navigation
[ ] Lab results and trends
[ ] Imaging viewer
[ ] Alerts and acknowledgment
[ ] Document upload and annotation
[ ] Q&A and hands-on practice

IT Admin Training (1-2 hours)¶

Agenda:
[ ] Integration monitoring dashboard
[ ] Failed job review and retry
[ ] User management (add/remove users)
[ ] Basic troubleshooting
[ ] Escalation process

Week 6: Go-Live¶

Pre-Launch Checklist¶

[ ] All UAT issues resolved
[ ] Training completed for all user groups
[ ] Integration monitoring alerts configured
[ ] Support escalation path documented
[ ] Customer Success introduced to stakeholders

Soft Launch (Days 1-3)¶

[ ] Enable access for pilot user group (5-10 users)
[ ] Monitor closely for errors and feedback
[ ] Daily check-in calls with pilot users
[ ] Fix any emerging issues immediately

Full Go-Live (Days 4-7)¶

[ ] Expand access to all target users
[ ] Monitor adoption metrics (logins, usage patterns)
[ ] Continue daily monitoring for first week
[ ] Transition to regular support model

Post-Launch¶

[ ] Week 2: Collect initial feedback, address issues
[ ] Week 4: First review meeting with stakeholders
[ ] Week 8: Formal pilot review with metrics
[ ] Ongoing: Monthly check-ins and QBRs

4. Integration Debugging Playbook¶

When to Use¶

HL7/FHIR message ingestion failures
PACS file drop not detected
Data appearing in wrong patient record
Use cases: [INT-001] through [INT-006], [ING-001] through [ING-020]

Common Issues & Resolutions¶

Issue: HL7 Messages Not Being Received¶

Diagnosis Steps:
[ ] Check HL7 listener service status
[ ] Verify network connectivity to hospital endpoint
[ ] Check firewall rules (port typically 2575)
[ ] Review listener logs for connection attempts
[ ] Test with sample message using HAPI Test Panel

Resolution:

- If service down: Restart HL7 listener, alert if not auto-recovered
- If network issue: Coordinate with hospital IT
- If firewall: Request firewall rule update

Issue: HL7 Message Parse Errors¶

Diagnosis Steps:
[ ] Retrieve failed message from DLQ
[ ] Identify parse error (segment, field, encoding)
[ ] Compare against expected HL7 format
[ ] Check for hospital-specific message variations

Resolution:

- If format variation: Update parser mapping
- If encoding issue: Adjust character set handling
- If required field missing: Add validation handling
- Reprocess failed message after fix

Issue: Patient Identity Mismatch¶

Diagnosis Steps:
[ ] Compare patient identifiers in source message vs. bundle
[ ] Check for duplicate patient records
[ ] Verify ABHA ID normalization logic
[ ] Review MPI (Master Patient Index) matching rules

Resolution:

- If duplicate: Merge patient records (with audit)
- If identifier mismatch: Update normalization logic
- If MPI issue: Adjust matching algorithm

Issue: PACS Files Not Detected¶

Diagnosis Steps:
[ ] Check file watcher service status
[ ] Verify file drop path is accessible
[ ] Check for file permission issues
[ ] Review watcher logs for errors

Resolution:

- If service down: Restart watcher
- If path issue: Correct configuration
- If permission: Fix file/folder permissions
- Backfill any missed files

5. Model Rollback Playbook¶

When to Use¶

OCR/ASR accuracy degradation detected
Model producing unexpected outputs
Performance regression after model update
Use case: [QAS-004] Model Drift Detection

Pre-Rollback Assessment¶

[ ] Confirm issue is model-related (not data or infrastructure)
[ ] Document current model version and deployment date
[ ] Identify previous stable model version
[ ] Estimate impact scope (affected documents/audio)

Rollback Procedure¶

Step 1: Prepare Rollback¶

[ ] Retrieve previous model artifacts from storage
[ ] Verify previous model is available and tested
[ ] Notify team of pending rollback
[ ] Schedule rollback during low-traffic period (if not emergency)

Step 2: Execute Rollback¶

# Switch to previous model version
./scripts/model_deploy.sh \
  --model ocr-v1.2.3 \
  --environment production \
  --rollback

# Verify deployment
./scripts/verify_model.sh --model ocr-v1.2.3

Step 3: Verify Rollback¶

[ ] Test with sample documents/audio
[ ] Compare accuracy metrics to baseline
[ ] Monitor processing pipeline for errors
[ ] Confirm throughput returned to normal

Step 4: Reprocess Affected Data¶

[ ] Identify documents/audio processed with faulty model
[ ] Create backfill manifest for reprocessing
[ ] Execute backfill using Data Backfill Playbook
[ ] Notify stakeholders of reprocessed data

Step 5: Root Cause Analysis¶

[ ] Document what caused model degradation
[ ] Identify gaps in testing or monitoring
[ ] Update validation criteria for future deployments
[ ] Create action items to prevent recurrence

6. Disaster Recovery Drill Playbook¶

When to Use¶

Monthly scheduled DR drill
After significant infrastructure changes
Regulatory audit preparation
Use case: [OPS-006] Run Disaster Recovery Drill

DR Drill Types¶

Type	Frequency	Scope
Tabletop	Monthly	Walk through procedures, no actual failover
Partial Failover	Quarterly	Failover non-critical components
Full Failover	Annually	Complete failover to DR site

Drill Procedure (Tabletop)¶

Preparation (1 day before)¶

[ ] Notify participants (Ops, Eng, Leadership)
[ ] Prepare scenario description
[ ] Verify runbooks are up to date
[ ] Prepare checklist for walkthrough

Execution (1-2 hours)¶

[ ] Present scenario: "Primary region outage detected"
[ ] Walk through detection and alerting process
[ ] Review failover steps (read from runbook)
[ ] Discuss decision points and escalations
[ ] Identify gaps in procedures
[ ] Document questions and action items

Post-Drill¶

[ ] Send summary to participants
[ ] Create tickets for identified gaps
[ ] Update runbooks with improvements
[ ] Schedule next drill

Drill Procedure (Full Failover)¶

Pre-Drill (1 week before)¶

[ ] Schedule maintenance window (4-6 hours)
[ ] Notify customers of planned maintenance
[ ] Verify DR site is synced and ready
[ ] Prepare rollback plan

Execution¶

[ ] Put primary site in maintenance mode
[ ] Initiate failover to DR site
[ ] Verify all services functional in DR
[ ] Run smoke tests (patient search, timeline, ingestion)
[ ] Measure failover time (target: <1 hour)
[ ] Document any issues encountered

Failback¶

[ ] Re-sync primary site from DR
[ ] Initiate failback to primary
[ ] Verify all services functional
[ ] Resume normal operations
[ ] End maintenance window

Post-Drill¶

[ ] Document total downtime
[ ] Analyze issues and gaps
[ ] Update DR runbooks
[ ] Report to leadership and compliance

7. API Key Rotation Playbook¶

When to Use¶

Scheduled key rotation (quarterly)
Security incident requiring immediate rotation
Employee offboarding with API access
Use case: [OPS-003] Rotate API Keys

Rotation Procedure¶

Step 1: Generate New Keys¶

[ ] Log into API management console
[ ] Generate new API key/secret pair
[ ] Store new credentials in secrets manager
[ ] Do NOT revoke old key yet

Step 2: Update Integrations¶

[ ] Identify all services using the key
[ ] Update each service with new credentials

  - HL7 listener
  - FHIR client
  - PACS file watcher
  - Internal API clients
[ ] Deploy updated configurations

Step 3: Verify¶

[ ] Test each integration with new key
[ ] Confirm no auth failures in logs
[ ] Monitor for 24-48 hours

Step 4: Revoke Old Key¶

[ ] After verification period, revoke old key
[ ] Document rotation in operations log
[ ] Update key expiry tracking

Emergency Rotation (Compromised Key)¶

[ ] Immediately revoke compromised key
[ ] Generate and deploy new key (may cause brief outage)
[ ] Investigate scope of compromise
[ ] Notify security team
[ ] Update stakeholders on impact

When to Use¶

Patient requests data deletion/processing stop
DPDP Act compliance requirement
Use case: [SEC-003] Process Consent Revocation, [CONS-002] Revoke DPDP Consent

Processing Procedure¶

Step 1: Receive Request¶

[ ] Verify requestor identity (patient or authorized representative)
[ ] Log request in consent management system
[ ] Acknowledge receipt to requestor within 24 hours

Step 2: Identify Affected Data¶

[ ] Query all patient data using identifiers (ABHA ID, MRN)
[ ] List data categories:

  - Clinical records
  - Imaging studies
  - Processed documents (OCR/ASR outputs)
  - Audit logs (may be retained for compliance)
[ ] Document scope for processing

Step 3: Execute Revocation¶

[ ] Mark patient consent as revoked
[ ] Stop ongoing processing for patient
[ ] Depending on request type:

  - Processing stop: Disable future processing
  - Data deletion: Execute deletion workflow
[ ] Retain audit logs of consent and deletion (DPDP requirement)

Step 4: Verify & Confirm¶

[ ] Verify data is no longer accessible
[ ] Verify no future processing occurs
[ ] Send confirmation to requestor within 30 days
[ ] Document completion in consent system

Audit Trail¶

All consent revocation actions must be logged with:

Requestor identity (verified)
Request timestamp
Data scope affected
Actions taken
Completion timestamp
Operator who processed request

Document Owner: Platform Operations Team
Last Updated: 2024-12-09
Next Review: After each playbook is used (update with learnings)

Playbooks¶

1. Incident Response Playbook¶

When to Use¶

Severity Classification¶

Response Procedure¶

Step 1: Acknowledge (Within SLA)¶

Step 2: Assess Impact¶

Step 3: Mitigate¶

Step 4: Communicate¶

Step 5: Resolve & Verify¶

Step 6: Post-Mortem (SEV-1/2 only)¶

Escalation Path¶

2. Data Backfill Playbook¶

When to Use¶

Pre-Backfill Checklist¶

Backfill Procedure¶

Step 1: Prepare¶

Step 2: Execute¶

Step 3: Verify¶

Step 4: Finalize¶

Rollback Plan¶

3. Hospital Onboarding Playbook¶

When to Use¶

Timeline Overview¶

Week 1: Kickoff¶

Day 1-2: Project Setup¶

Day 2-3: Kickoff Meeting¶

Day 4-5: System Access & Discovery¶

Week 2-3: Integration Development¶

HL7 Integration¶

FHIR Integration¶

PACS Integration¶

Week 4: Testing¶

Integration Testing¶

User Acceptance Testing (UAT)¶

Week 5: Training¶

Clinical User Training (2-3 hours)¶

IT Admin Training (1-2 hours)¶

Week 6: Go-Live¶

Pre-Launch Checklist¶

Soft Launch (Days 1-3)¶

Full Go-Live (Days 4-7)¶

Post-Launch¶

4. Integration Debugging Playbook¶

When to Use¶

Common Issues & Resolutions¶

Issue: HL7 Messages Not Being Received¶

Issue: HL7 Message Parse Errors¶

Issue: Patient Identity Mismatch¶

Issue: PACS Files Not Detected¶

5. Model Rollback Playbook¶

When to Use¶

Pre-Rollback Assessment¶

Rollback Procedure¶

Step 1: Prepare Rollback¶

Step 2: Execute Rollback¶

Step 3: Verify Rollback¶

Step 4: Reprocess Affected Data¶

Step 5: Root Cause Analysis¶

6. Disaster Recovery Drill Playbook¶

When to Use¶

DR Drill Types¶

Drill Procedure (Tabletop)¶

Preparation (1 day before)¶

Execution (1-2 hours)¶

Post-Drill¶

Drill Procedure (Full Failover)¶

Pre-Drill (1 week before)¶

Execution¶

Failback¶

Post-Drill¶

7. API Key Rotation Playbook¶

When to Use¶

Rotation Procedure¶

Step 1: Generate New Keys¶

Step 2: Update Integrations¶

Step 3: Verify¶

Step 4: Revoke Old Key¶

Emergency Rotation (Compromised Key)¶

8. Consent Revocation Processing¶