Skip to content

feat: declare the [multimodal] extra and ingest/redact contract#555

Merged
maziyarpanahi merged 1 commit into
maziyarpanahi:masterfrom
pardeep-singh:pardeep/issue223-multimodal-extra
Jun 22, 2026
Merged

feat: declare the [multimodal] extra and ingest/redact contract#555
maziyarpanahi merged 1 commit into
maziyarpanahi:masterfrom
pardeep-singh:pardeep/issue223-multimodal-extra

Conversation

@pardeep-singh

Copy link
Copy Markdown
Contributor

Implements #223 (OM-058): the single, stable foundation the multimodal ingestion line builds on — a declared [multimodal] pip extra and a small shared ingest/redact contract — so each later per-format PR registers a handler instead of re-wiring dependencies and the text<->offset surface.

This is foundation-only: no per-format parsers and no OCR wiring (those are #226 and later). It unblocks OM-060/061/062/064/082.

Changes

  • pyproject.toml — new multimodal optional-dependency extra (pdfplumber, python-docx, Pillow), kept out of the base install.
  • openmed/multimodal/base.py
    • ExtractedDocument — normalized plain text plus a char-offset -> source-location map (SourceSpan with page, char range, optional bbox, and per-block metadata). Helpers: from_blocks(...), location_at(offset), text_for(span).
    • Lazy extension -> handler registry via register_handler(extensions, handler), so later format tasks register without editing the dispatcher.
    • redact_document(path, *, policy=None, models=None) dispatcher that routes by file extension.
  • Import isolationbase.py has no module-level heavy imports, and multimodal ships its own lightweight exceptions.py (reusing openmed.ners exception would have pulled in transformers/torch). Importing openmed or openmed.multimodal never imports pdfplumber/python-docx/Pillow. redact_document raises a clear MissingDependencyError naming the multimodal extra when deps are absent, and UnsupportedDocumentError for unregistered file types.
  • openmed/multimodal/__init__.py — re-exports the public contract.
  • scripts/release/check_license_policy.py — added reviewed license entries for the three new deps and added HPND (Pillows actual OSI-approved permissive license) to the allowed-marker list. Flagging this since it broadens the global permissive allowlist; happy to adjust if youd prefer a narrower handling.

Acceptance criteria

  • openmed.multimodal.redact_document and ExtractedDocument are exposed (re-exported from the package).
  • Importing openmed does not import pdfplumber/python-docx/Pillow (asserted via a subprocess sys.modules inspection, so it is robust even when the extra is installed).
  • redact_document raises a clear, ImportError-style message naming the multimodal extra when deps are missing.
  • ExtractedDocument round-trips text<->offset map for a synthetic fixture.

Testing

  • New multimodal tests pass; full unit suite passes (.venv/bin/python -m pytest tests/ -q -> 1625 passed, 12 skipped).
  • The only 2 failures in the full run are pre-existing, network-gated integration tests (tests/integration/test_sentence_detection_real.py, which download dslim/bert-base-NER); they fail identically on master offline and are unrelated to this change.
  • ruff check and ruff format clean on the changed files.

Out of scope

  • Per-format parsers (PDF/DOCX/HTML/DICOM/image).
  • OCR engine wiring (separate adapter task).
  • The [structured] extra.

Closes #223

Establishes the shared foundation for the multimodal ingestion line (OM-058)
so per-format ingesters can register without re-wiring dependencies or the
text<->offset surface.

- pyproject.toml: add a 'multimodal' optional-dependency extra (pdfplumber,
  python-docx, Pillow) kept out of the base install.
- openmed/multimodal/base.py: ExtractedDocument (normalized text + a
  char-offset->source-location map via SourceSpan + per-block metadata) with
  from_blocks/location_at/text_for; a lazy extension->handler registry
  (register_handler); and a redact_document(path, *, policy=None, models=None)
  dispatcher that routes by file extension.
- Import isolation: base.py has no module-level heavy imports, and multimodal
  ships its own lightweight exceptions module so importing openmed (core) or
  openmed.multimodal never pulls in pdfplumber/python-docx/Pillow (or
  transformers/torch). redact_document raises a clear MissingDependencyError
  naming the multimodal extra when the deps are absent, and
  UnsupportedDocumentError for unregistered file types.
- openmed/multimodal/__init__.py: re-export the public contract.
- scripts/release/check_license_policy.py: add reviewed license entries for
  pdfplumber/python-docx/Pillow and allow HPND (Pillow's license) in the
  permissive marker list.
- tests: import-isolation via subprocess sys.modules inspection, the
  missing-extra error, ExtractedDocument text<->offset round-trip, and
  registry dispatch incl. unsupported extensions.

No per-format parsers or OCR wiring (those are follow-up tasks).

Closes maziyarpanahi#223

@maziyarpanahi maziyarpanahi left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pardeep-singh. I reviewed this against #223 / OM-058.

The implementation matches the foundation scope I would want for this issue: the [multimodal] extra is declared without changing the base install, ExtractedDocument and SourceSpan provide the shared text/offset/source-location contract, handlers register lazily by extension, redact_document() keeps per-format parsing out of this PR, and openmed.multimodal remains lightweight enough to import without pulling in pdfplumber/python-docx/Pillow.

I did not need to change the branch code. I copied the labels from #223 onto the PR.

Verification on the current PR checkout:

  • PYTHONPATH=/private/tmp/openmed-pr-555 /Users/maziyar/Developer/openmed/.venv/bin/python -m pytest tests/unit/multimodal tests/unit/release/test_license_policy.py -q -> 19 passed
  • /Users/maziyar/Developer/openmed/.venv/bin/ruff check openmed/multimodal pyproject.toml scripts/release/check_license_policy.py tests/unit/multimodal tests/unit/release/test_license_policy.py -> passed
  • /Users/maziyar/Developer/openmed/.venv/bin/ruff format --check openmed/multimodal scripts/release/check_license_policy.py tests/unit/multimodal tests/unit/release/test_license_policy.py -> passed

Hosted CI is green and the branch is clean/mergeable. This is ready to merge from code review.

@pardeep-singh

Copy link
Copy Markdown
Contributor Author

@maziyarpanahi Lets merge it then 😃

@maziyarpanahi

Copy link
Copy Markdown
Owner

@maziyarpanahi Lets merge it then 😃

Sorry for the delay. I have frozen the main branch to prepare the v1.6.0 release. This will go to the next release. Will be able to merge in a few hours. Sorry again, we will do much more frequent releases from now on so nobody's blocked

@maziyarpanahi

Copy link
Copy Markdown
Owner

actually, I will freeze the v1.6.0 in a separate branch so we can merge the rest into main branch. I am merging this now.

Thank you so much!

@maziyarpanahi maziyarpanahi merged commit 96d9e7d into maziyarpanahi:master Jun 22, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New capability P1 High roadmap-v2 OpenMed V2 roadmap backlog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Declare the [multimodal] optional-dependency extra and ingest/redact contract

2 participants