You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The OCR path defaults to English, but OpenMed ships 12-language PII and scanned clinical documents arrive in many languages. OM-045 does not address OCR-language coverage. This is a small, self-contained enhancement: let callers select OCR languages and align them with the downstream PII language model, with a sensible default.
Scope
Extend openmed/multimodal/ocr.py to accept a languages parameter passed to the Tesseract/PaddleOCR adapter (Tesseract lang codes, PaddleOCR lang names).
Add a mapping from OpenMed PII language codes (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr) to the corresponding OCR engine language identifiers so redact_document(lang=) configures OCR and detection together.
Document the per-engine language-pack install requirement (Tesseract traineddata, PaddleOCR model download) in the extra notes.
Default to English when no language is specified to preserve current behavior.
Acceptance criteria
ocr(..., languages=['fr']) passes the correct lang identifier to each adapter (verified via the fake engine recording the call).
The OpenMed-lang -> OCR-lang map covers all 12 wired PII languages.
Default behavior remains English when unspecified.
test suite green: .venv/bin/python -m pytest tests/ -q
Out of scope
Adding new PII model languages (language theme).
Bundling traineddata/model files in the wheel.
Pixel redaction mechanics.
Files
openmed/multimodal/ocr.py
tests/unit/multimodal/test_ocr_languages.py
Task: OM-150 · Milestone: Backlog · Priority: P3 · Size: S
Depends on: — · Blocks: —
Roadmap: Coverage map THEME D STILL MISSING (6): OCR-language coverage beyond English; sec 5.8
Spec: PLANS/V2/EXECUTION/tasks/OM-150.md
Summary
The OCR path defaults to English, but OpenMed ships 12-language PII and scanned clinical documents arrive in many languages. OM-045 does not address OCR-language coverage. This is a small, self-contained enhancement: let callers select OCR languages and align them with the downstream PII language model, with a sensible default.
Scope
languagesparameter passed to the Tesseract/PaddleOCR adapter (Tesseract lang codes, PaddleOCR lang names).Acceptance criteria
Out of scope
Files
Task: OM-150 · Milestone: Backlog · Priority: P3 · Size: S
Depends on: — · Blocks: —
Roadmap: Coverage map THEME D STILL MISSING (6): OCR-language coverage beyond English; sec 5.8
Spec: PLANS/V2/EXECUTION/tasks/OM-150.md