OCR Pipeline¶
Technology Stack: See MCP OCR Servers for engine trade-off analysis.
- Engines: Tesseract, EasyOCR, PaddleOCR, Surya, Docling, Chandra OCR
- Indexing: LlamaIndex for RAG pipelines
- Protocol: MCP (Model Context Protocol) for orchestration
UC-PROC-001: Queue OCR Job¶
Purpose: Accept file upload and schedule processing.
| Property | Value |
|---|---|
| Actor | API Server |
| Trigger | POST /api/upload/document |
| Priority | P1 |
Main Success Scenario:
1. Validate request (file size < 10MB, type PDF/IMG)
2. Generate unique `jobId` (UUID)
3. Upload file to S3 bucket `raw/`
Key: `patients/{abhaId}/docs/{jobId}.pdf`
4. Create Job record in DB: `status=queued`
5. Push message to `ocr-queue`
Payload: `{ jobId, s3Key, patientId }`
6. Return HTTP 202 Accepted with `jobId`
Alternative Flows:
Alt-1: File Size Exceeded
- File > 10MB limit - Return HTTP 413 Payload Too Large - Log rejection with filesizeAlt-2: Invalid Format
- File type not PDF/PNG/JPG - Return HTTP 415 Unsupported Media Type - Include supported types in error responseObservability:
- Metric: ocr_jobs_queued_total, ocr_upload_size_bytes
- Log: {"event": "ocr_queued", "jobId": "abc123", "patientId": "..."}
Acceptance Criteria: 1. [ ] Returns 202 immediately (async) 2. [ ] File safely stored in S3 before queuing
UC-PROC-002: Detect Document Language¶
Purpose: Determine if English or Hindi engine is needed.
| Property | Value |
|---|---|
| Actor | OCR Worker |
| Trigger | Job in ocr-queue |
| Priority | P1 |
Main Success Scenario:
1. Dequeue job and download file from S3
2. Check if `language` was provided in API request
3. If not, extract text from Page 1 using lightweight tool (pdftotext)
4. Run `langdetect` on sample text
5. If `hi` detected -> Set model `hin`
6. If `en` detected -> Set model `eng`
7. Pass to Execution Step
Alternative Flows:
Alt-1: Detection Failed
- Text extraction yields empty string (scanned image) - Default to `eng` (English) - Log warning: `{"event": "lang_detect_fail", "jobId": "..."}`Observability:
- Metric: ocr_lang_detected_total{lang="hi"}, ocr_lang_detect_failures
- Log: {"event": "lang_detected", "jobId": "abc123", "lang": "hi", "confidence": 0.92}
Acceptance Criteria: 1. [ ] Correctly identifies Hindi vs English documents 2. [ ] Defaults to English if detection fails
UC-PROC-003: Execute Tesseract Engine¶
Purpose: Run the core OCR binary to extract text.
| Property | Value |
|---|---|
| Actor | OCR Worker |
| Trigger | Language detected |
| Priority | P1 |
Main Success Scenario:
1. Construct Tesseract command
`tesseract input.pdf output -l {lang} --oem 3 --psm 1`
2. Spawn child process
3. Wait for process exit (Timeout: 60s)
4. Read `output.txt` (extracted text)
5. Read `output.tsv` (confidence data)
6. Pass results to Processing Step
Alternative Flows:
Alt-1: Tesseract Crash
- Process exit code != 0 - Log stderr - Retry job (up to 3 times) - If still failing, move to DLQObservability:
- Metric: ocr_execution_duration_seconds, ocr_crashes_total
- Log: {"event": "tesseract_complete", "jobId": "abc123", "pages": 3, "duration_ms": 4500}
Acceptance Criteria: 1. [ ] Uses correct language model 2. [ ] Enforces timeout to prevent zombie processes
UC-PROC-004: Process OCR Output¶
Purpose: Clean text, calculate quality, and update bundle.
| Property | Value |
|---|---|
| Actor | OCR Worker |
| Trigger | OCR execution complete |
| Priority | P1 |
Main Success Scenario:
1. Calculate Average Confidence from TSV data
2. If Confidence < 0.7:
- Mark `needsReview = true`
- Trigger "Low Quality" alert
3. Construct `Document` object
- `extractedText`: (Cleaned string)
- `confidence`: 0.85
- `ocrEngine`: "tesseract-5.3"
4. Append to Patient Bundle `documents` array
5. Update Job status to `completed`
Alternative Flows:
Alt-1: Bundle Update Failure
- Patient bundle locked or storage unavailable - Retry with exponential backoff (max 5 attempts) - Move to `ocr_bundle_update_dlq` if still failingObservability:
- Metric: ocr_avg_confidence, ocr_low_quality_docs_total
- Log: {"event": "ocr_processed", "jobId": "abc123", "confidence": 0.85, "needsReview": false}
Acceptance Criteria: 1. [ ] Flags low-confidence documents 2. [ ] Updates bundle atomically
UC-PROC-007: Classify Document Type¶
Purpose: Label OCR'd documents (e.g., Discharge Summary, Pathology Report) for routing and UX.
| Property | Value |
|---|---|
| Actor | Document Intelligence Worker |
| Trigger | OCR text available |
| Priority | P1 |
Main Success Scenario:
1. Load TF-IDF features or transformer embeddings from OCR text
2. Run multi-class classifier (ONNX) with labels configured per customer
3. Assign top class if probability > 0.7, else mark as `unknown`
4. Persist `documentType` on the Job record
5. Emit routing event e.g., `document.type=pathology`
6. Update Patient Bundle document entry with `type` and `confidence`
Acceptance Criteria: 1. [ ] Model artifacts versioned and rollback-capable 2. [ ] Supports customer-specific overrides (force label) 3. [ ] Records confusion matrix metrics nightly
UC-PROC-008: Redact Sensitive Entities¶
Purpose: Automatically mask identifiers (phone, address) before exposing text externally.
| Property | Value |
|---|---|
| Actor | PII Redaction Worker |
| Trigger | Document classified as shareable |
| Priority | P1 |
Main Success Scenario:
1. Run NER model (spaCy/HF) over OCR text
2. Identify entities of types PERSON, PHONE, ADDRESS, MRN
3. Replace spans with tags (e.g., "[REDACTED_PHONE]")
4. Store original text in encrypted blob store
5. Save redacted text for UI/API consumption
6. Attach audit trail (who accessed, when redacted)
Alternative Flows:
Alt-1: Entity Confidence Low
- Average entity confidence < 0.6 - Flag document `needsManualReview` - Notify Compliance queueAcceptance Criteria: 1. [ ] Redaction latency < 1s for 5-page doc 2. [ ] Masked text is irreversible without key management approval 3. [ ] Supports export of both redacted and original with proper scopes
UC-PROC-009: Extract Structured Fields¶
Purpose: Convert semi-structured documents into discrete key-value fields.
| Property | Value |
|---|---|
| Actor | Structured Extraction Worker |
| Trigger | Redacted text available |
| Priority | P1 |
Main Success Scenario:
1. Apply regex + ML hybrid templates (e.g., BRAT) for sections like "Diagnosis" or "Plan"
2. Capture `key`, `value`, `confidence` triplets
3. Map known keys to canonical schema (e.g., "Chemo Regimen" -> `treatment.regimen`)
4. Persist extracted fields in `documents[].structuredFields`
5. Emit metrics per field (fill rate, confidence)
Acceptance Criteria: 1. [ ] Supports section-order independent detection 2. [ ] Captures provenance pointer (page + coordinates) 3. [ ] Allows fallback to manual entry if extraction confidence < threshold
UC-PROC-010: Summarize Document Content¶
Purpose: Provide concise physician-ready summary paragraphs.
| Property | Value |
|---|---|
| Actor | Summarization Service |
| Trigger | Structured fields saved |
| Priority | P2 |
Main Success Scenario:
1. Chunk OCR text into <= 2k tokens with semantic overlap
2. Prompt LLM (Azure OpenAI or on-prem) with instructions + safety rails
3. Collect summary bullets (max 5) and risk statements
4. Run toxicity / hallucination guardrail (QA prompt)
5. Store summary on document and push notification to UI
Acceptance Criteria: 1. [ ] Summaries capped at 750 characters 2. [ ] Redacted text only (no PHI leakage) 3. [ ] Regenerate endpoint available for clinicians