Skip to content

OCR Pipeline

Technology Stack: See MCP OCR Servers for engine trade-off analysis.

  • Engines: Tesseract, EasyOCR, PaddleOCR, Surya, Docling, Chandra OCR
  • Indexing: LlamaIndex for RAG pipelines
  • Protocol: MCP (Model Context Protocol) for orchestration

UC-PROC-001: Queue OCR Job

Purpose: Accept file upload and schedule processing.

Property Value
Actor API Server
Trigger POST /api/upload/document
Priority P1

Main Success Scenario:

1. Validate request (file size < 10MB, type PDF/IMG)
2. Generate unique `jobId` (UUID)
3. Upload file to S3 bucket `raw/`
   Key: `patients/{abhaId}/docs/{jobId}.pdf`
4. Create Job record in DB: `status=queued`
5. Push message to `ocr-queue`
   Payload: `{ jobId, s3Key, patientId }`
6. Return HTTP 202 Accepted with `jobId`

Alternative Flows:

Alt-1: File Size Exceeded - File > 10MB limit - Return HTTP 413 Payload Too Large - Log rejection with filesize
Alt-2: Invalid Format - File type not PDF/PNG/JPG - Return HTTP 415 Unsupported Media Type - Include supported types in error response

Observability: - Metric: ocr_jobs_queued_total, ocr_upload_size_bytes - Log: {"event": "ocr_queued", "jobId": "abc123", "patientId": "..."}

Acceptance Criteria: 1. [ ] Returns 202 immediately (async) 2. [ ] File safely stored in S3 before queuing


UC-PROC-002: Detect Document Language

Purpose: Determine if English or Hindi engine is needed.

Property Value
Actor OCR Worker
Trigger Job in ocr-queue
Priority P1

Main Success Scenario:

1. Dequeue job and download file from S3
2. Check if `language` was provided in API request
3. If not, extract text from Page 1 using lightweight tool (pdftotext)
4. Run `langdetect` on sample text
5. If `hi` detected -> Set model `hin`
6. If `en` detected -> Set model `eng`
7. Pass to Execution Step

Alternative Flows:

Alt-1: Detection Failed - Text extraction yields empty string (scanned image) - Default to `eng` (English) - Log warning: `{"event": "lang_detect_fail", "jobId": "..."}`

Observability: - Metric: ocr_lang_detected_total{lang="hi"}, ocr_lang_detect_failures - Log: {"event": "lang_detected", "jobId": "abc123", "lang": "hi", "confidence": 0.92}

Acceptance Criteria: 1. [ ] Correctly identifies Hindi vs English documents 2. [ ] Defaults to English if detection fails


UC-PROC-003: Execute Tesseract Engine

Purpose: Run the core OCR binary to extract text.

Property Value
Actor OCR Worker
Trigger Language detected
Priority P1

Main Success Scenario:

1. Construct Tesseract command
   `tesseract input.pdf output -l {lang} --oem 3 --psm 1`
2. Spawn child process
3. Wait for process exit (Timeout: 60s)
4. Read `output.txt` (extracted text)
5. Read `output.tsv` (confidence data)
6. Pass results to Processing Step

Alternative Flows:

Alt-1: Tesseract Crash - Process exit code != 0 - Log stderr - Retry job (up to 3 times) - If still failing, move to DLQ

Observability: - Metric: ocr_execution_duration_seconds, ocr_crashes_total - Log: {"event": "tesseract_complete", "jobId": "abc123", "pages": 3, "duration_ms": 4500}

Acceptance Criteria: 1. [ ] Uses correct language model 2. [ ] Enforces timeout to prevent zombie processes


UC-PROC-004: Process OCR Output

Purpose: Clean text, calculate quality, and update bundle.

Property Value
Actor OCR Worker
Trigger OCR execution complete
Priority P1

Main Success Scenario:

1. Calculate Average Confidence from TSV data
2. If Confidence < 0.7:
   - Mark `needsReview = true`
   - Trigger "Low Quality" alert
3. Construct `Document` object
   - `extractedText`: (Cleaned string)
   - `confidence`: 0.85
   - `ocrEngine`: "tesseract-5.3"
4. Append to Patient Bundle `documents` array
5. Update Job status to `completed`

Alternative Flows:

Alt-1: Bundle Update Failure - Patient bundle locked or storage unavailable - Retry with exponential backoff (max 5 attempts) - Move to `ocr_bundle_update_dlq` if still failing

Observability: - Metric: ocr_avg_confidence, ocr_low_quality_docs_total - Log: {"event": "ocr_processed", "jobId": "abc123", "confidence": 0.85, "needsReview": false}

Acceptance Criteria: 1. [ ] Flags low-confidence documents 2. [ ] Updates bundle atomically


UC-PROC-007: Classify Document Type

Purpose: Label OCR'd documents (e.g., Discharge Summary, Pathology Report) for routing and UX.

Property Value
Actor Document Intelligence Worker
Trigger OCR text available
Priority P1

Main Success Scenario:

1. Load TF-IDF features or transformer embeddings from OCR text
2. Run multi-class classifier (ONNX) with labels configured per customer
3. Assign top class if probability > 0.7, else mark as `unknown`
4. Persist `documentType` on the Job record
5. Emit routing event e.g., `document.type=pathology`
6. Update Patient Bundle document entry with `type` and `confidence`

Acceptance Criteria: 1. [ ] Model artifacts versioned and rollback-capable 2. [ ] Supports customer-specific overrides (force label) 3. [ ] Records confusion matrix metrics nightly


UC-PROC-008: Redact Sensitive Entities

Purpose: Automatically mask identifiers (phone, address) before exposing text externally.

Property Value
Actor PII Redaction Worker
Trigger Document classified as shareable
Priority P1

Main Success Scenario:

1. Run NER model (spaCy/HF) over OCR text
2. Identify entities of types PERSON, PHONE, ADDRESS, MRN
3. Replace spans with tags (e.g., "[REDACTED_PHONE]")
4. Store original text in encrypted blob store
5. Save redacted text for UI/API consumption
6. Attach audit trail (who accessed, when redacted)

Alternative Flows:

Alt-1: Entity Confidence Low - Average entity confidence < 0.6 - Flag document `needsManualReview` - Notify Compliance queue

Acceptance Criteria: 1. [ ] Redaction latency < 1s for 5-page doc 2. [ ] Masked text is irreversible without key management approval 3. [ ] Supports export of both redacted and original with proper scopes


UC-PROC-009: Extract Structured Fields

Purpose: Convert semi-structured documents into discrete key-value fields.

Property Value
Actor Structured Extraction Worker
Trigger Redacted text available
Priority P1

Main Success Scenario:

1. Apply regex + ML hybrid templates (e.g., BRAT) for sections like "Diagnosis" or "Plan"
2. Capture `key`, `value`, `confidence` triplets
3. Map known keys to canonical schema (e.g., "Chemo Regimen" -> `treatment.regimen`)
4. Persist extracted fields in `documents[].structuredFields`
5. Emit metrics per field (fill rate, confidence)

Acceptance Criteria: 1. [ ] Supports section-order independent detection 2. [ ] Captures provenance pointer (page + coordinates) 3. [ ] Allows fallback to manual entry if extraction confidence < threshold


UC-PROC-010: Summarize Document Content

Purpose: Provide concise physician-ready summary paragraphs.

Property Value
Actor Summarization Service
Trigger Structured fields saved
Priority P2

Main Success Scenario:

1. Chunk OCR text into <= 2k tokens with semantic overlap
2. Prompt LLM (Azure OpenAI or on-prem) with instructions + safety rails
3. Collect summary bullets (max 5) and risk statements
4. Run toxicity / hallucination guardrail (QA prompt)
5. Store summary on document and push notification to UI

Acceptance Criteria: 1. [ ] Summaries capped at 750 characters 2. [ ] Redacted text only (no PHI leakage) 3. [ ] Regenerate endpoint available for clinicians