Skip to content

Playbooks

Document Purpose: This document provides operational runbooks for common scenarios. Each playbook includes step-by-step procedures, escalation paths, and verification steps.


1. Incident Response Playbook

When to Use

  • P0/P1 alert triggered
  • Customer-reported production issue
  • Security incident detected

Severity Classification

Severity Definition Response Time Examples
SEV-1 Complete platform outage or data loss 15 minutes DB failure, all integrations down
SEV-2 Major feature unavailable 1 hour OCR pipeline down, timeline not loading
SEV-3 Partial degradation 4 hours Slow response, intermittent errors
SEV-4 Minor issue Next business day UI glitch, cosmetic issues

Response Procedure

Step 1: Acknowledge (Within SLA)

[ ] Acknowledge alert in monitoring system
[ ] Post in #incidents Slack channel: "Investigating [ALERT_NAME] - [YOUR_NAME]"
[ ] Start incident timer for SLA tracking

Step 2: Assess Impact

[ ] Check affected services (Dashboard: Pipeline Health)
[ ] Identify customer impact scope (single hospital vs. all)
[ ] Classify severity using table above
[ ] Update incident channel with severity and impact summary

Step 3: Mitigate

[ ] Apply immediate fix or workaround
[ ] If rollback needed, follow Model Rollback or Deployment Rollback playbook
[ ] If data fix needed, document and execute with verification
[ ] Update incident channel every 15 minutes for SEV-1, 30 min for SEV-2

Step 4: Communicate

For SEV-1/2:
[ ] Notify Customer Success for customer communication
[ ] Update status page (if public)
[ ] Send summary email to leadership

For all severities:
[ ] Keep incident channel updated with progress

Step 5: Resolve & Verify

[ ] Deploy permanent fix
[ ] Verify fix in production (test cases, monitoring)
[ ] Mark incident as resolved in tracking system
[ ] Send resolution summary to incident channel

Step 6: Post-Mortem (SEV-1/2 only)

[ ] Schedule post-mortem meeting within 48 hours
[ ] Document timeline, root cause, contributing factors
[ ] Identify action items for prevention
[ ] Share post-mortem document with team

Escalation Path

  1. On-Call Engineer (Primary)
  2. Backend Lead (15 min escalation)
  3. Engineering Manager (30 min escalation)
  4. CTO (45 min escalation for SEV-1)
  5. CEO (1 hour for customer-impacting SEV-1)

2. Data Backfill Playbook

When to Use

  • After bug fix affecting historical data
  • After model upgrade requiring reprocessing
  • Customer requests historical data reprocessing
  • Use case: [OPS-004] Reprocess Historical Backlog

Pre-Backfill Checklist

[ ] Identify affected records (patient IDs, date range)
[ ] Estimate processing time and resource impact
[ ] Schedule during low-traffic window (2 AM - 6 AM IST)
[ ] Notify Operations team and Customer Success
[ ] Create backup of affected data

Backfill Procedure

Step 1: Prepare

# Identify records to reprocess
./scripts/identify_backfill_records.sh \
  --start-date "2024-01-01" \
  --end-date "2024-12-01" \
  --criteria "document_type=pathology" \
  --output backfill_manifest.json

# Validate manifest (review record count, sample records)
./scripts/validate_manifest.sh backfill_manifest.json

Step 2: Execute

# Start backfill with rate limiting
./scripts/run_backfill.sh \
  --manifest backfill_manifest.json \
  --concurrency 5 \
  --rate-limit 100/min \
  --dry-run # Remove for actual execution

# Monitor progress
./scripts/monitor_backfill.sh --job-id BACKFILL_JOB_ID

Step 3: Verify

[ ] Check completion status (100% processed)
[ ] Validate sample records (spot check 10-20 records)
[ ] Verify no regression in data quality
[ ] Compare before/after metrics

Step 4: Finalize

[ ] Document backfill in operations log
[ ] Notify stakeholders of completion
[ ] Archive manifest for audit trail

Rollback Plan

[ ] If errors detected, pause backfill immediately
[ ] Restore from pre-backfill backup
[ ] Document failure reason
[ ] Fix issue and re-plan backfill

3. Hospital Onboarding Playbook

When to Use

  • New customer signed contract
  • New pilot approved
  • Expansion to new department/site
  • Use cases: [OPS-001] through [OPS-006], [INT-001] through [INT-006]

Timeline Overview

Week Phase Activities
1 Kickoff Project planning, stakeholder mapping, system access
2-3 Integration Develop/configure HL7, FHIR, PACS adapters
4 Testing Integration testing, UAT, data validation
5 Training Clinical user training, IT admin training
6 Go-Live Soft launch, monitoring, support

Week 1: Kickoff

Day 1-2: Project Setup

[ ] Create project folder in shared drive
[ ] Set up Slack channel: #project-[hospital-name]
[ ] Schedule kickoff meeting with hospital stakeholders
[ ] Assign internal team (PM, Engineer, CSM)

Day 2-3: Kickoff Meeting

Agenda:
[ ] Introductions and roles
[ ] Review scope and success metrics
[ ] Confirm timeline milestones
[ ] Identify stakeholder contacts
[ ] Discuss integration requirements
[ ] Schedule follow-up meetings

Day 4-5: System Access & Discovery

[ ] Obtain VPN/network access (if required)
[ ] Document EMR/LIS/PACS system details

  - Vendor and version
  - Interface type (HL7 v2.x, FHIR R4, file drop)
  - Endpoint URLs or file paths
[ ] Review sample data files (anonymized)
[ ] Create integration specification document

Week 2-3: Integration Development

HL7 Integration

[ ] Configure HL7 listener endpoint
[ ] Map HL7 segments to internal data model

  - PID → Patient demographics
  - OBR → Orders
  - OBX → Results
[ ] Implement message acknowledgment (ACK)
[ ] Test with sample messages

FHIR Integration

[ ] Configure FHIR client with hospital endpoint
[ ] Map FHIR resources to internal model

  - Patient, Observation, DiagnosticReport, etc.
[ ] Implement webhook receiver (if push-based)
[ ] Test with sample bundles

PACS Integration

[ ] Configure DICOM listener or file watcher
[ ] Map DICOM metadata to patient identity
[ ] Set up image storage and viewer artifacts
[ ] Test with sample imaging studies

Week 4: Testing

Integration Testing

[ ] End-to-end message flow test (HL7/FHIR → Bundle)
[ ] PACS ingestion test (DICOM → Timeline)
[ ] Error handling test (malformed messages, timeouts)
[ ] Performance test (target: <30s ingestion latency)

User Acceptance Testing (UAT)

[ ] Schedule UAT sessions with clinical users
[ ] Prepare UAT test cases:

  - Search for patient by name/ABHA ID
  - View patient timeline
  - View lab trends chart
  - View imaging study
  - Acknowledge clinical alert
[ ] Document UAT feedback and issues
[ ] Fix critical issues before go-live

Week 5: Training

Clinical User Training (2-3 hours)

Agenda:
[ ] Platform overview and login
[ ] Patient search and timeline navigation
[ ] Lab results and trends
[ ] Imaging viewer
[ ] Alerts and acknowledgment
[ ] Document upload and annotation
[ ] Q&A and hands-on practice

IT Admin Training (1-2 hours)

Agenda:
[ ] Integration monitoring dashboard
[ ] Failed job review and retry
[ ] User management (add/remove users)
[ ] Basic troubleshooting
[ ] Escalation process

Week 6: Go-Live

Pre-Launch Checklist

[ ] All UAT issues resolved
[ ] Training completed for all user groups
[ ] Integration monitoring alerts configured
[ ] Support escalation path documented
[ ] Customer Success introduced to stakeholders

Soft Launch (Days 1-3)

[ ] Enable access for pilot user group (5-10 users)
[ ] Monitor closely for errors and feedback
[ ] Daily check-in calls with pilot users
[ ] Fix any emerging issues immediately

Full Go-Live (Days 4-7)

[ ] Expand access to all target users
[ ] Monitor adoption metrics (logins, usage patterns)
[ ] Continue daily monitoring for first week
[ ] Transition to regular support model

Post-Launch

[ ] Week 2: Collect initial feedback, address issues
[ ] Week 4: First review meeting with stakeholders
[ ] Week 8: Formal pilot review with metrics
[ ] Ongoing: Monthly check-ins and QBRs

4. Integration Debugging Playbook

When to Use

  • HL7/FHIR message ingestion failures
  • PACS file drop not detected
  • Data appearing in wrong patient record
  • Use cases: [INT-001] through [INT-006], [ING-001] through [ING-020]

Common Issues & Resolutions

Issue: HL7 Messages Not Being Received

Diagnosis Steps:
[ ] Check HL7 listener service status
[ ] Verify network connectivity to hospital endpoint
[ ] Check firewall rules (port typically 2575)
[ ] Review listener logs for connection attempts
[ ] Test with sample message using HAPI Test Panel

Resolution:

- If service down: Restart HL7 listener, alert if not auto-recovered
- If network issue: Coordinate with hospital IT
- If firewall: Request firewall rule update

Issue: HL7 Message Parse Errors

Diagnosis Steps:
[ ] Retrieve failed message from DLQ
[ ] Identify parse error (segment, field, encoding)
[ ] Compare against expected HL7 format
[ ] Check for hospital-specific message variations

Resolution:

- If format variation: Update parser mapping
- If encoding issue: Adjust character set handling
- If required field missing: Add validation handling
- Reprocess failed message after fix

Issue: Patient Identity Mismatch

Diagnosis Steps:
[ ] Compare patient identifiers in source message vs. bundle
[ ] Check for duplicate patient records
[ ] Verify ABHA ID normalization logic
[ ] Review MPI (Master Patient Index) matching rules

Resolution:

- If duplicate: Merge patient records (with audit)
- If identifier mismatch: Update normalization logic
- If MPI issue: Adjust matching algorithm

Issue: PACS Files Not Detected

Diagnosis Steps:
[ ] Check file watcher service status
[ ] Verify file drop path is accessible
[ ] Check for file permission issues
[ ] Review watcher logs for errors

Resolution:

- If service down: Restart watcher
- If path issue: Correct configuration
- If permission: Fix file/folder permissions
- Backfill any missed files

5. Model Rollback Playbook

When to Use

  • OCR/ASR accuracy degradation detected
  • Model producing unexpected outputs
  • Performance regression after model update
  • Use case: [QAS-004] Model Drift Detection

Pre-Rollback Assessment

[ ] Confirm issue is model-related (not data or infrastructure)
[ ] Document current model version and deployment date
[ ] Identify previous stable model version
[ ] Estimate impact scope (affected documents/audio)

Rollback Procedure

Step 1: Prepare Rollback

[ ] Retrieve previous model artifacts from storage
[ ] Verify previous model is available and tested
[ ] Notify team of pending rollback
[ ] Schedule rollback during low-traffic period (if not emergency)

Step 2: Execute Rollback

# Switch to previous model version
./scripts/model_deploy.sh \
  --model ocr-v1.2.3 \
  --environment production \
  --rollback

# Verify deployment
./scripts/verify_model.sh --model ocr-v1.2.3

Step 3: Verify Rollback

[ ] Test with sample documents/audio
[ ] Compare accuracy metrics to baseline
[ ] Monitor processing pipeline for errors
[ ] Confirm throughput returned to normal

Step 4: Reprocess Affected Data

[ ] Identify documents/audio processed with faulty model
[ ] Create backfill manifest for reprocessing
[ ] Execute backfill using Data Backfill Playbook
[ ] Notify stakeholders of reprocessed data

Step 5: Root Cause Analysis

[ ] Document what caused model degradation
[ ] Identify gaps in testing or monitoring
[ ] Update validation criteria for future deployments
[ ] Create action items to prevent recurrence

6. Disaster Recovery Drill Playbook

When to Use

  • Monthly scheduled DR drill
  • After significant infrastructure changes
  • Regulatory audit preparation
  • Use case: [OPS-006] Run Disaster Recovery Drill

DR Drill Types

Type Frequency Scope
Tabletop Monthly Walk through procedures, no actual failover
Partial Failover Quarterly Failover non-critical components
Full Failover Annually Complete failover to DR site

Drill Procedure (Tabletop)

Preparation (1 day before)

[ ] Notify participants (Ops, Eng, Leadership)
[ ] Prepare scenario description
[ ] Verify runbooks are up to date
[ ] Prepare checklist for walkthrough

Execution (1-2 hours)

[ ] Present scenario: "Primary region outage detected"
[ ] Walk through detection and alerting process
[ ] Review failover steps (read from runbook)
[ ] Discuss decision points and escalations
[ ] Identify gaps in procedures
[ ] Document questions and action items

Post-Drill

[ ] Send summary to participants
[ ] Create tickets for identified gaps
[ ] Update runbooks with improvements
[ ] Schedule next drill

Drill Procedure (Full Failover)

Pre-Drill (1 week before)

[ ] Schedule maintenance window (4-6 hours)
[ ] Notify customers of planned maintenance
[ ] Verify DR site is synced and ready
[ ] Prepare rollback plan

Execution

[ ] Put primary site in maintenance mode
[ ] Initiate failover to DR site
[ ] Verify all services functional in DR
[ ] Run smoke tests (patient search, timeline, ingestion)
[ ] Measure failover time (target: <1 hour)
[ ] Document any issues encountered

Failback

[ ] Re-sync primary site from DR
[ ] Initiate failback to primary
[ ] Verify all services functional
[ ] Resume normal operations
[ ] End maintenance window

Post-Drill

[ ] Document total downtime
[ ] Analyze issues and gaps
[ ] Update DR runbooks
[ ] Report to leadership and compliance

7. API Key Rotation Playbook

When to Use

  • Scheduled key rotation (quarterly)
  • Security incident requiring immediate rotation
  • Employee offboarding with API access
  • Use case: [OPS-003] Rotate API Keys

Rotation Procedure

Step 1: Generate New Keys

[ ] Log into API management console
[ ] Generate new API key/secret pair
[ ] Store new credentials in secrets manager
[ ] Do NOT revoke old key yet

Step 2: Update Integrations

[ ] Identify all services using the key
[ ] Update each service with new credentials

  - HL7 listener
  - FHIR client
  - PACS file watcher
  - Internal API clients
[ ] Deploy updated configurations

Step 3: Verify

[ ] Test each integration with new key
[ ] Confirm no auth failures in logs
[ ] Monitor for 24-48 hours

Step 4: Revoke Old Key

[ ] After verification period, revoke old key
[ ] Document rotation in operations log
[ ] Update key expiry tracking

Emergency Rotation (Compromised Key)

[ ] Immediately revoke compromised key
[ ] Generate and deploy new key (may cause brief outage)
[ ] Investigate scope of compromise
[ ] Notify security team
[ ] Update stakeholders on impact

When to Use

  • Patient requests data deletion/processing stop
  • DPDP Act compliance requirement
  • Use case: [SEC-003] Process Consent Revocation, [CONS-002] Revoke DPDP Consent

Processing Procedure

Step 1: Receive Request

[ ] Verify requestor identity (patient or authorized representative)
[ ] Log request in consent management system
[ ] Acknowledge receipt to requestor within 24 hours

Step 2: Identify Affected Data

[ ] Query all patient data using identifiers (ABHA ID, MRN)
[ ] List data categories:

  - Clinical records
  - Imaging studies
  - Processed documents (OCR/ASR outputs)
  - Audit logs (may be retained for compliance)
[ ] Document scope for processing

Step 3: Execute Revocation

[ ] Mark patient consent as revoked
[ ] Stop ongoing processing for patient
[ ] Depending on request type:

  - Processing stop: Disable future processing
  - Data deletion: Execute deletion workflow
[ ] Retain audit logs of consent and deletion (DPDP requirement)

Step 4: Verify & Confirm

[ ] Verify data is no longer accessible
[ ] Verify no future processing occurs
[ ] Send confirmation to requestor within 30 days
[ ] Document completion in consent system

Audit Trail

All consent revocation actions must be logged with:

  • Requestor identity (verified)
  • Request timestamp
  • Data scope affected
  • Actions taken
  • Completion timestamp
  • Operator who processed request

Document Owner: Platform Operations Team
Last Updated: 2024-12-09
Next Review: After each playbook is used (update with learnings)