OCR Pipeline¶

Technology Stack: See MCP OCR Servers for engine trade-off analysis.

Engines: Tesseract, EasyOCR, PaddleOCR, Surya, Docling, Chandra OCR

Indexing: LlamaIndex for RAG pipelines

Protocol: MCP (Model Context Protocol) for orchestration

UC-PROC-001: Queue OCR Job¶

Purpose: Accept file upload and schedule processing.

Property	Value
Actor	API Server
Trigger	`POST /api/upload/document`
Priority	P1

Main Success Scenario:

1. Validate request (file size < 10MB, type PDF/IMG)
2. Generate unique `jobId` (UUID)
3. Upload file to S3 bucket `raw/`
   Key: `patients/{abhaId}/docs/{jobId}.pdf`

4. Create Job record in DB: `status=queued`
5. Push message to `ocr-queue`
   Payload: `{ jobId, s3Key, patientId }`

6. Return HTTP 202 Accepted with `jobId`

Alternative Flows:

Alt-1: File Size Exceeded

- File > 10MB limit - Return HTTP 413 Payload Too Large - Log rejection with filesize

Alt-2: Invalid Format

- File type not PDF/PNG/JPG - Return HTTP 415 Unsupported Media Type - Include supported types in error response

Observability:

Metric: ocr_jobs_queued_total, ocr_upload_size_bytes
Log: {"event": "ocr_queued", "jobId": "abc123", "patientId": "..."}

Acceptance Criteria:

[ ] Returns 202 immediately (async)
[ ] File safely stored in S3 before queuing

UC-PROC-002: Detect Document Language¶

Purpose: Determine if English or Hindi engine is needed.

Property	Value
Actor	OCR Worker
Trigger	Job in `ocr-queue`
Priority	P1

Main Success Scenario:

1. Dequeue job and download file from S3
2. Check if `language` was provided in API request
3. If not, extract text from Page 1 using lightweight tool (pdftotext)
4. Run `langdetect` on sample text
5. If `hi` detected -> Set model `hin`
6. If `en` detected -> Set model `eng`
7. Pass to Execution Step

Alternative Flows:

Alt-1: Detection Failed

- Text extraction yields empty string (scanned image) - Default to `eng` (English) - Log warning: `{"event": "lang_detect_fail", "jobId": "..."}`

Observability:

Metric: ocr_lang_detected_total{lang="hi"}, ocr_lang_detect_failures
Log: {"event": "lang_detected", "jobId": "abc123", "lang": "hi", "confidence": 0.92}

Acceptance Criteria:

[ ] Correctly identifies Hindi vs English documents
[ ] Defaults to English if detection fails

UC-PROC-003: Execute Tesseract Engine¶

Purpose: Run the core OCR binary to extract text.

Property	Value
Actor	OCR Worker
Trigger	Language detected
Priority	P1

Main Success Scenario:

1. Construct Tesseract command
   `tesseract input.pdf output -l {lang} --oem 3 --psm 1`

2. Spawn child process
3. Wait for process exit (Timeout: 60s)
4. Read `output.txt` (extracted text)
5. Read `output.tsv` (confidence data)
6. Pass results to Processing Step

Alternative Flows:

Alt-1: Tesseract Crash

- Process exit code != 0 - Log stderr - Retry job (up to 3 times) - If still failing, move to DLQ

Observability:

Metric: ocr_execution_duration_seconds, ocr_crashes_total
Log: {"event": "tesseract_complete", "jobId": "abc123", "pages": 3, "duration_ms": 4500}

Acceptance Criteria:

[ ] Uses correct language model
[ ] Enforces timeout to prevent zombie processes

UC-PROC-004: Process OCR Output¶

Purpose: Clean text, calculate quality, and update bundle.

Property	Value
Actor	OCR Worker
Trigger	OCR execution complete
Priority	P1

Main Success Scenario:

1. Calculate Average Confidence from TSV data
2. If Confidence < 0.7:
   - Mark `needsReview = true`
   - Trigger "Low Quality" alert
3. Construct `Document` object
   - `extractedText`: (Cleaned string)
   - `confidence`: 0.85
   - `ocrEngine`: "tesseract-5.3"
4. Append to Patient Bundle `documents` array
5. Update Job status to `completed`

Alternative Flows:

Alt-1: Bundle Update Failure

- Patient bundle locked or storage unavailable - Retry with exponential backoff (max 5 attempts) - Move to `ocr_bundle_update_dlq` if still failing

Observability:

Metric: ocr_avg_confidence, ocr_low_quality_docs_total
Log: {"event": "ocr_processed", "jobId": "abc123", "confidence": 0.85, "needsReview": false}

Acceptance Criteria:

[ ] Flags low-confidence documents
[ ] Updates bundle atomically

UC-PROC-007: Classify Document Type¶

Purpose: Label OCR'd documents (e.g., Discharge Summary, Pathology Report) for routing and UX.

Property	Value
Actor	Document Intelligence Worker
Trigger	OCR text available
Priority	P1

Main Success Scenario:

1. Load TF-IDF features or transformer embeddings from OCR text
2. Run multi-class classifier (ONNX) with labels configured per customer
3. Assign top class if probability > 0.7, else mark as `unknown`
4. Persist `documentType` on the Job record
5. Emit routing event e.g., `document.type=pathology`
6. Update Patient Bundle document entry with `type` and `confidence`

Acceptance Criteria:

[ ] Model artifacts versioned and rollback-capable
[ ] Supports customer-specific overrides (force label)
[ ] Records confusion matrix metrics nightly

UC-PROC-008: Redact Sensitive Entities¶

Purpose: Automatically mask identifiers (phone, address) before exposing text externally.

Property	Value
Actor	PII Redaction Worker
Trigger	Document classified as shareable
Priority	P1

Main Success Scenario:

1. Run NER model (spaCy/HF) over OCR text
2. Identify entities of types PERSON, PHONE, ADDRESS, MRN
3. Replace spans with tags (e.g., "[REDACTED_PHONE]")
4. Store original text in encrypted blob store
5. Save redacted text for UI/API consumption
6. Attach audit trail (who accessed, when redacted)

Alternative Flows:

Alt-1: Entity Confidence Low

- Average entity confidence < 0.6 - Flag document `needsManualReview` - Notify Compliance queue

Acceptance Criteria:

[ ] Redaction latency < 1s for 5-page doc
[ ] Masked text is irreversible without key management approval
[ ] Supports export of both redacted and original with proper scopes

UC-PROC-009: Extract Structured Fields¶

Purpose: Convert semi-structured documents into discrete key-value fields.

Property	Value
Actor	Structured Extraction Worker
Trigger	Redacted text available
Priority	P1

Main Success Scenario:

1. Apply regex + ML hybrid templates (e.g., BRAT) for sections like "Diagnosis" or "Plan"
2. Capture `key`, `value`, `confidence` triplets
3. Map known keys to canonical schema (e.g., "Chemo Regimen" -> `treatment.regimen`)
4. Persist extracted fields in `documents[].structuredFields`
5. Emit metrics per field (fill rate, confidence)

Acceptance Criteria:

[ ] Supports section-order independent detection
[ ] Captures provenance pointer (page + coordinates)
[ ] Allows fallback to manual entry if extraction confidence < threshold

UC-PROC-010: Summarize Document Content¶

Purpose: Provide concise physician-ready summary paragraphs.

Property	Value
Actor	Summarization Service
Trigger	Structured fields saved
Priority	P2

Main Success Scenario:

1. Chunk OCR text into <= 2k tokens with semantic overlap
2. Prompt LLM (Azure OpenAI or on-prem) with instructions + safety rails
3. Collect summary bullets (max 5) and risk statements
4. Run toxicity / hallucination guardrail (QA prompt)
5. Store summary on document and push notification to UI

Acceptance Criteria:

[ ] Summaries capped at 750 characters
[ ] Redacted text only (no PHI leakage)
[ ] Regenerate endpoint available for clinicians