feat: declare the [multimodal] extra and ingest/redact contract#555
Conversation
Establishes the shared foundation for the multimodal ingestion line (OM-058) so per-format ingesters can register without re-wiring dependencies or the text<->offset surface. - pyproject.toml: add a 'multimodal' optional-dependency extra (pdfplumber, python-docx, Pillow) kept out of the base install. - openmed/multimodal/base.py: ExtractedDocument (normalized text + a char-offset->source-location map via SourceSpan + per-block metadata) with from_blocks/location_at/text_for; a lazy extension->handler registry (register_handler); and a redact_document(path, *, policy=None, models=None) dispatcher that routes by file extension. - Import isolation: base.py has no module-level heavy imports, and multimodal ships its own lightweight exceptions module so importing openmed (core) or openmed.multimodal never pulls in pdfplumber/python-docx/Pillow (or transformers/torch). redact_document raises a clear MissingDependencyError naming the multimodal extra when the deps are absent, and UnsupportedDocumentError for unregistered file types. - openmed/multimodal/__init__.py: re-export the public contract. - scripts/release/check_license_policy.py: add reviewed license entries for pdfplumber/python-docx/Pillow and allow HPND (Pillow's license) in the permissive marker list. - tests: import-isolation via subprocess sys.modules inspection, the missing-extra error, ExtractedDocument text<->offset round-trip, and registry dispatch incl. unsupported extensions. No per-format parsers or OCR wiring (those are follow-up tasks). Closes maziyarpanahi#223
maziyarpanahi
left a comment
There was a problem hiding this comment.
Thank you @pardeep-singh. I reviewed this against #223 / OM-058.
The implementation matches the foundation scope I would want for this issue: the [multimodal] extra is declared without changing the base install, ExtractedDocument and SourceSpan provide the shared text/offset/source-location contract, handlers register lazily by extension, redact_document() keeps per-format parsing out of this PR, and openmed.multimodal remains lightweight enough to import without pulling in pdfplumber/python-docx/Pillow.
I did not need to change the branch code. I copied the labels from #223 onto the PR.
Verification on the current PR checkout:
PYTHONPATH=/private/tmp/openmed-pr-555 /Users/maziyar/Developer/openmed/.venv/bin/python -m pytest tests/unit/multimodal tests/unit/release/test_license_policy.py -q-> 19 passed/Users/maziyar/Developer/openmed/.venv/bin/ruff check openmed/multimodal pyproject.toml scripts/release/check_license_policy.py tests/unit/multimodal tests/unit/release/test_license_policy.py-> passed/Users/maziyar/Developer/openmed/.venv/bin/ruff format --check openmed/multimodal scripts/release/check_license_policy.py tests/unit/multimodal tests/unit/release/test_license_policy.py-> passed
Hosted CI is green and the branch is clean/mergeable. This is ready to merge from code review.
|
@maziyarpanahi Lets merge it then 😃 |
Sorry for the delay. I have frozen the main branch to prepare the v1.6.0 release. This will go to the next release. Will be able to merge in a few hours. Sorry again, we will do much more frequent releases from now on so nobody's blocked |
|
actually, I will freeze the v1.6.0 in a separate branch so we can merge the rest into main branch. I am merging this now. Thank you so much! |
Implements #223 (OM-058): the single, stable foundation the multimodal ingestion line builds on — a declared
[multimodal]pip extra and a small shared ingest/redact contract — so each later per-format PR registers a handler instead of re-wiring dependencies and the text<->offset surface.This is foundation-only: no per-format parsers and no OCR wiring (those are #226 and later). It unblocks OM-060/061/062/064/082.
Changes
pyproject.toml— newmultimodaloptional-dependency extra (pdfplumber,python-docx,Pillow), kept out of the base install.openmed/multimodal/base.pyExtractedDocument— normalized plain text plus a char-offset -> source-location map (SourceSpanwith page, char range, optional bbox, and per-block metadata). Helpers:from_blocks(...),location_at(offset),text_for(span).register_handler(extensions, handler), so later format tasks register without editing the dispatcher.redact_document(path, *, policy=None, models=None)dispatcher that routes by file extension.base.pyhas no module-level heavy imports, andmultimodalships its own lightweightexceptions.py(reusingopenmed.ners exception would have pulled in transformers/torch). Importingopenmedoropenmed.multimodalnever imports pdfplumber/python-docx/Pillow.redact_documentraises a clearMissingDependencyErrornaming themultimodalextra when deps are absent, andUnsupportedDocumentErrorfor unregistered file types.openmed/multimodal/__init__.py— re-exports the public contract.scripts/release/check_license_policy.py— added reviewed license entries for the three new deps and added HPND (Pillows actual OSI-approved permissive license) to the allowed-marker list. Flagging this since it broadens the global permissive allowlist; happy to adjust if youd prefer a narrower handling.Acceptance criteria
openmed.multimodal.redact_documentandExtractedDocumentare exposed (re-exported from the package).openmeddoes not import pdfplumber/python-docx/Pillow (asserted via a subprocesssys.modulesinspection, so it is robust even when the extra is installed).redact_documentraises a clear, ImportError-style message naming themultimodalextra when deps are missing.ExtractedDocumentround-trips text<->offset map for a synthetic fixture.Testing
.venv/bin/python -m pytest tests/ -q-> 1625 passed, 12 skipped).tests/integration/test_sentence_detection_real.py, which downloaddslim/bert-base-NER); they fail identically onmasteroffline and are unrelated to this change.ruff checkandruff formatclean on the changed files.Out of scope
[structured]extra.Closes #223