Skip to content

Add OCR language-pack configuration for non-English clinical scans #315

@maziyarpanahi

Description

@maziyarpanahi

Summary

The OCR path defaults to English, but OpenMed ships 12-language PII and scanned clinical documents arrive in many languages. OM-045 does not address OCR-language coverage. This is a small, self-contained enhancement: let callers select OCR languages and align them with the downstream PII language model, with a sensible default.

Scope

  • Extend openmed/multimodal/ocr.py to accept a languages parameter passed to the Tesseract/PaddleOCR adapter (Tesseract lang codes, PaddleOCR lang names).
  • Add a mapping from OpenMed PII language codes (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr) to the corresponding OCR engine language identifiers so redact_document(lang=) configures OCR and detection together.
  • Document the per-engine language-pack install requirement (Tesseract traineddata, PaddleOCR model download) in the extra notes.
  • Default to English when no language is specified to preserve current behavior.

Acceptance criteria

  • ocr(..., languages=['fr']) passes the correct lang identifier to each adapter (verified via the fake engine recording the call).
  • The OpenMed-lang -> OCR-lang map covers all 12 wired PII languages.
  • Default behavior remains English when unspecified.
  • test suite green: .venv/bin/python -m pytest tests/ -q

Out of scope

  • Adding new PII model languages (language theme).
  • Bundling traineddata/model files in the wheel.
  • Pixel redaction mechanics.

Files

  • openmed/multimodal/ocr.py
  • tests/unit/multimodal/test_ocr_languages.py

Task: OM-150 · Milestone: Backlog · Priority: P3 · Size: S
Depends on: — · Blocks: —
Roadmap: Coverage map THEME D STILL MISSING (6): OCR-language coverage beyond English; sec 5.8
Spec: PLANS/V2/EXECUTION/tasks/OM-150.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Strategicgood first issueGood for newcomershelp wantedExtra attention is neededimprovementHardening / refactor of existing coderoadmap-v2OpenMed V2 roadmap backlog

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions