Skip to content

Add various improvements for CHUV applications#111

Merged
fabnemEPFL merged 24 commits into
EPFLiGHT:masterfrom
MichelDucartier:master
Jun 23, 2025
Merged

Add various improvements for CHUV applications#111
fabnemEPFL merged 24 commits into
EPFLiGHT:masterfrom
MichelDucartier:master

Conversation

@MichelDucartier

Copy link
Copy Markdown
Collaborator

This pull request introduces significant enhancements and updates to the post-processing pipeline, dependency management, and document processing capabilities. Key changes include the addition of new post-processing modules (Translator and MetaData Infusor), updates to dependencies, improvements to the DOCX processor, and expanded functionality in the RAG retriever. Below is a categorized summary of the most important changes:

Post-Processing Enhancements:

  • Added a TranslatorPostProcessor to support translation of text with configurable target language, confidence thresholds, and constrained languages (src/mmore/process/post_processor/translator/base.py, src/mmore/process/post_processor/translator/__init__.py) [1] [2].
  • Introduced the MetaDataInfusor post-processor to inject metadata into text at configurable positions (beginning or end) (src/mmore/process/post_processor/metafuse/base.py, src/mmore/process/post_processor/metafuse/__init__.py) [1] [2].
  • Added a FileNamer tagger for extracting file names from metadata (src/mmore/process/post_processor/tagger/file_namer.py, src/mmore/process/post_processor/tagger/__init__.py) [1] [2].

Dependency Updates:

  • Downgraded langchain-huggingface to version 0.2.0 and added new dependencies: langid, mammoth, argostranslate, and sentence-transformers (pyproject.toml) [1] [2].

DOCX Processor Improvements:

  • Replaced the existing DOCX processing logic with mammoth for converting DOCX to HTML, added support for extracting and saving embedded images, and improved metadata handling (src/mmore/process/processors/docx_processor.py) [1] [2].

RAG Retriever Enhancements:

  • Added support for filtering search results using an optional expr parameter and expanded output fields to include all document attributes (src/mmore/rag/retriever.py) [1] [2].

Miscellaneous Updates:

  • Updated the process function to allow skipping already processed files via a new skip_already_processed flag (src/mmore/run_process.py) [1] [2].
  • Corrected the documents_path in config.yaml to reflect the updated post-processor output path (examples/index/config.yaml).

@MichelDucartier MichelDucartier self-assigned this Jun 21, 2025
@MichelDucartier MichelDucartier linked an issue Jun 21, 2025 that may be closed by this pull request
Comment thread src/mmore/process/post_processor/translator/base.py Outdated
Comment thread src/mmore/process/post_processor/translator/base.py Outdated
Comment thread src/mmore/process/processors/docx_processor.py Outdated
@fabnemEPFL fabnemEPFL merged commit 275f692 into EPFLiGHT:master Jun 23, 2025
4 checks passed
JCHAVEROT pushed a commit to JCHAVEROT/mmore that referenced this pull request Jun 19, 2026
* Solve conflicts

* Add translator + update dependencies

* Modify docx_processor and md_processor

* Fix imports

* Fix deps for rag inference

* Some trailing endline stuff (windows is weird)

* Add metafuse

* Add metafuser

* Add expr in parameters of retriever

* Reset dispatcher

* Reset run_rag

* Reset some things

* Clean

* Fix toml

* Reset again

* Add better docx processor and fix metafuse

* Add examples for translating postprocessor

* Downgrade langchain-huggingface

* Add documentation

* Fix langchain huggingface version

* Ruff linting

* Ruff formatting

* Clean

* fixed typing

---------

Co-authored-by: Zhang Michael <Michael.Zhang@chuv.ch>
Co-authored-by: fabnemEPFL <fabrice.nemo@epfl.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a translator in the post-processor

2 participants