Add various improvements for CHUV applications#111
Merged
Conversation
fabnemEPFL
reviewed
Jun 22, 2025
JCHAVEROT
pushed a commit
to JCHAVEROT/mmore
that referenced
this pull request
Jun 19, 2026
* Solve conflicts * Add translator + update dependencies * Modify docx_processor and md_processor * Fix imports * Fix deps for rag inference * Some trailing endline stuff (windows is weird) * Add metafuse * Add metafuser * Add expr in parameters of retriever * Reset dispatcher * Reset run_rag * Reset some things * Clean * Fix toml * Reset again * Add better docx processor and fix metafuse * Add examples for translating postprocessor * Downgrade langchain-huggingface * Add documentation * Fix langchain huggingface version * Ruff linting * Ruff formatting * Clean * fixed typing --------- Co-authored-by: Zhang Michael <Michael.Zhang@chuv.ch> Co-authored-by: fabnemEPFL <fabrice.nemo@epfl.ch>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant enhancements and updates to the post-processing pipeline, dependency management, and document processing capabilities. Key changes include the addition of new post-processing modules (Translator and MetaData Infusor), updates to dependencies, improvements to the DOCX processor, and expanded functionality in the RAG retriever. Below is a categorized summary of the most important changes:
Post-Processing Enhancements:
TranslatorPostProcessorto support translation of text with configurable target language, confidence thresholds, and constrained languages (src/mmore/process/post_processor/translator/base.py,src/mmore/process/post_processor/translator/__init__.py) [1] [2].MetaDataInfusorpost-processor to inject metadata into text at configurable positions (beginning or end) (src/mmore/process/post_processor/metafuse/base.py,src/mmore/process/post_processor/metafuse/__init__.py) [1] [2].FileNamertagger for extracting file names from metadata (src/mmore/process/post_processor/tagger/file_namer.py,src/mmore/process/post_processor/tagger/__init__.py) [1] [2].Dependency Updates:
langchain-huggingfaceto version0.2.0and added new dependencies:langid,mammoth,argostranslate, andsentence-transformers(pyproject.toml) [1] [2].DOCX Processor Improvements:
mammothfor converting DOCX to HTML, added support for extracting and saving embedded images, and improved metadata handling (src/mmore/process/processors/docx_processor.py) [1] [2].RAG Retriever Enhancements:
exprparameter and expanded output fields to include all document attributes (src/mmore/rag/retriever.py) [1] [2].Miscellaneous Updates:
processfunction to allow skipping already processed files via a newskip_already_processedflag (src/mmore/run_process.py) [1] [2].documents_pathinconfig.yamlto reflect the updated post-processor output path (examples/index/config.yaml).