Skip to content

Machine Learning Use Cases (ML)

UC-ML-001a: Curate Training Dataset

Purpose: Select and prepare high-quality data for fine-tuning.

Property Value
Actor Data Engineer
Trigger Weekly Schedule
Priority P2

Main Success Scenario:

1. Query "Golden Dataset" (UC-ML-003) for approved corrections
2. Filter for audio quality (SNR > 20dB) and transcript length
3. Split into Train/Val/Test sets (80/10/10)
4. Convert to HuggingFace Dataset format
5. Version control the dataset artifact (DVC)

Acceptance Criteria: 1. [ ] Zero PII leakage in training set 2. [ ] Balanced distribution of dialects


UC-ML-001b: Execute Fine-tuning Run

Purpose: Run the compute-intensive training job.

Property Value
Actor ML Ops Pipeline
Trigger Dataset Versioned
Priority P2

Main Success Scenario:

1. Provision GPU cluster (e.g., A100s)
2. Load base model (Whisper/Llama) and new dataset
3. Execute LoRA/QLoRA training loop
4. Log metrics (Loss, WER) to MLflow
5. Save model checkpoints to Model Registry

Acceptance Criteria: 1. [ ] Auto-shutdown of GPUs after completion 2. [ ] Alert on gradient explosion or NaN loss


UC-ML-002: Dialect Evaluation & Benchmarking

Purpose: Measure model performance across specific Indian languages/dialects.

Property Value
Actor QA / ML Team
Trigger New Model Candidate
Priority P1

Main Success Scenario:

1. Load benchmark datasets (Hindi, Telugu, Tamil, etc.)
2. Run inference with candidate model
3. Compute WER, CER (Character Error Rate) per dialect
4. Generate report comparing vs baseline
5. Flag regressions > 1%

Acceptance Criteria: 1. [ ] Covers top 5 target languages 2. [ ] Includes medical-specific vocabulary test 3. [ ] Automated pass/fail gates


UC-ML-003: Continuous Quality Feedback Loop

Purpose: Improve models using doctor corrections.

Property Value
Actor System
Trigger Doctor edits note
Priority P2

Main Success Scenario:

1. Capture "Diff" between generated note and final signed note
2. Anonymize and store as (Input, Correction) pair
3. Aggregate corrections by category (Hallucination, Missed Entity)
4. Add high-quality pairs to "Golden Dataset"
5. Trigger fine-tuning (UC-ML-001) when dataset grows by 10%

Acceptance Criteria: 1. [ ] Strict PII scrubbing before adding to training set 2. [ ] Doctor opt-in/opt-out for data usage 3. [ ] Quality filter to exclude bad edits