Machine Learning Use Cases (ML)¶

UC-ML-001a: Curate Training Dataset¶

Purpose: Select and prepare high-quality data for fine-tuning.

Property	Value
Actor	Data Engineer
Trigger	Weekly Schedule
Priority	P2

Main Success Scenario:

1. Query "Golden Dataset" (UC-ML-003) for approved corrections
2. Filter for audio quality (SNR > 20dB) and transcript length
3. Split into Train/Val/Test sets (80/10/10)
4. Convert to HuggingFace Dataset format
5. Version control the dataset artifact (DVC)

Acceptance Criteria:

[ ] Zero PII leakage in training set
[ ] Balanced distribution of dialects

UC-ML-001b: Execute Fine-tuning Run¶

Purpose: Run the compute-intensive training job.

Property	Value
Actor	ML Ops Pipeline
Trigger	Dataset Versioned
Priority	P2

Main Success Scenario:

1. Provision GPU cluster (e.g., A100s)
2. Load base model (Whisper/Llama) and new dataset
3. Execute LoRA/QLoRA training loop
4. Log metrics (Loss, WER) to MLflow
5. Save model checkpoints to Model Registry

Acceptance Criteria:

[ ] Auto-shutdown of GPUs after completion
[ ] Alert on gradient explosion or NaN loss

UC-ML-002: Dialect Evaluation & Benchmarking¶

Purpose: Measure model performance across specific Indian languages/dialects.

Property	Value
Actor	QA / ML Team
Trigger	New Model Candidate
Priority	P1

Main Success Scenario:

1. Load benchmark datasets (Hindi, Telugu, Tamil, etc.)
2. Run inference with candidate model
3. Compute WER, CER (Character Error Rate) per dialect
4. Generate report comparing vs baseline
5. Flag regressions > 1%

Acceptance Criteria:

[ ] Covers top 5 target languages
[ ] Includes medical-specific vocabulary test
[ ] Automated pass/fail gates

UC-ML-003: Continuous Quality Feedback Loop¶

Purpose: Improve models using doctor corrections.

Property	Value
Actor	System
Trigger	Doctor edits note
Priority	P2

Main Success Scenario:

1. Capture "Diff" between generated note and final signed note
2. Anonymize and store as (Input, Correction) pair
3. Aggregate corrections by category (Hallucination, Missed Entity)
4. Add high-quality pairs to "Golden Dataset"
5. Trigger fine-tuning (UC-ML-001) when dataset grows by 10%

Acceptance Criteria:

[ ] Strict PII scrubbing before adding to training set
[ ] Doctor opt-in/opt-out for data usage
[ ] Quality filter to exclude bad edits

Related: Processing | Quality & Safety | Operations