Skip to content

Commit fb45b76

Browse files
committed
Merge branch 'google-drive' of https://github.com/Matthew2357/mmore into google-drive
2 parents 6ff1e36 + c7e92c0 commit fb45b76

44 files changed

Lines changed: 1518 additions & 714 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: ci
2+
3+
on:
4+
push:
5+
workflow_dispatch:
6+
7+
jobs:
8+
docker:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- name: Docker meta
12+
id: meta
13+
uses: docker/metadata-action@v5
14+
with:
15+
# list of Docker images to use as base name for tags
16+
images: |
17+
androz2091/swiss-ai-mmore
18+
# generate Docker tags based on the following events/attributes
19+
tags: |
20+
type=schedule
21+
type=ref,event=branch
22+
type=ref,event=pr
23+
type=semver,pattern={{version}}
24+
type=semver,pattern={{major}}.{{minor}}
25+
type=semver,pattern={{major}}
26+
type=sha
27+
-
28+
name: Set up QEMU
29+
uses: docker/setup-qemu-action@v3
30+
-
31+
name: Set up Docker Buildx
32+
uses: docker/setup-buildx-action@v3
33+
-
34+
name: Login to Docker Hub
35+
uses: docker/login-action@v3
36+
with:
37+
username: ${{ secrets.DOCKERHUB_USERNAME }}
38+
password: ${{ secrets.DOCKERHUB_TOKEN }}
39+
-
40+
name: Build and push
41+
uses: docker/build-push-action@v6
42+
with:
43+
push: true
44+
tags: ${{ steps.meta.outputs.tags }}
45+
labels: ${{ steps.meta.outputs.labels }}

.github/workflows/tests.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ jobs:
2929
pip install -e '.[rag,dev]' # or custom setup
3030
pip install pytest # if not in requirements.txt
3131
32+
- name: Show installed cohere and langchain-cohere versions
33+
run: |
34+
pip show cohere || echo "Cohere not installed"
35+
pip show langchain-cohere || echo "Langchain-cohere not installed"
36+
37+
3238
- name: Run tests
3339
run: |
3440
pytest

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ repos:
1010
args: [
1111
--fix, # auto-fix lint + style issues
1212
--unsafe-fixes, # allows formatting & import sorting
13-
]
13+
]

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ To install the package simply run:
4343
pip install mmore
4444
```
4545

46-
> :warning: This is a big package with a lot of dependencies, so we recommend to use `uv` to handle `pip` installations. [Check our tutorial on uv](./docs/uv.md).
46+
> :warning: This is a big package with a lot of dependencies, so we recommend to use `uv` to handle `pip` installations. [Check our tutorial on uv](https://github.com/swiss-ai/mmore/blob/master/docs/uv.md).
4747
4848
### Minimal Example
4949

@@ -90,22 +90,22 @@ To launch the MMORE pipeline, follow the specialised instructions in the docs.
9090
1. **:page_facing_up: Input Documents**
9191
Upload your multimodal documents (PDFs, videos, spreadsheets, and m(m)ore) into the pipeline.
9292

93-
2. [**:mag: Process**](./docs/process.md)
93+
2. [**:mag: Process**](https://github.com/swiss-ai/mmore/blob/master/docs/process.md)
9494
Extracts and standardizes text, metadata, and multimedia content from diverse file formats. Easily extensible! You can add your own processors to handle new file types.
9595
*Supports fast processing for specific types.*
9696

97-
3. [**:file_folder: Index**](./docs/index.md)
98-
Organizes extracted data into a **hybrid retrieval-ready Vector Store DB**, combining dense and sparse indexing through [Milvus](https://milvus.io/). Your vector DB can also be remotely hosted and then you only have to provide a standard API. There is also an [HTTP Index API](./docs/index_api.md) for adding new files on the fly with HTTP requests.
97+
3. [**:file_folder: Index**](https://github.com/swiss-ai/mmore/blob/master/docs/index.md)
98+
Organizes extracted data into a **hybrid retrieval-ready Vector Store DB**, combining dense and sparse indexing through [Milvus](https://milvus.io/). Your vector DB can also be remotely hosted and then you only have to provide a standard API. There is also an [HTTP Index API](https://github.com/swiss-ai/mmore/blob/master/docs/index_api.md) for adding new files on the fly with HTTP requests.
9999

100-
4. [**:robot: RAG**](./docs/rag.md)
100+
4. [**:robot: RAG**](https://github.com/swiss-ai/mmore/blob/master/docs/rag.md)
101101
Use the indexed documents inside a **Retrieval-Augmented Generation (RAG) system** that provides a [LangChain](https://www.langchain.com/) interface. Plug in any LLM with a compatible interface or add new ones through an easy-to-use interface.
102102
*Supports API hosting or local inference.*
103103

104104
5. **:tada: Evaluation**
105105
*Coming soon*
106106
An easy way to evaluate the performance of your RAG system using Ragas.
107107

108-
See [the `/docs` directory](./docs) for additional details on each modules and hands-on tutorials on parts of the pipeline.
108+
See [the `/docs` directory](https://github.com/swiss-ai/mmore/blob/master/docs) for additional details on each modules and hands-on tutorials on parts of the pipeline.
109109

110110

111111
#### :construction: Supported File Types

docs/index_api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ Returns the file with binary content.
173173
- File types supported:
174174

175175
```
176-
.pdf, .docx, .pptx, .md, .txt, .xlsx, .xls, .csv, .mp4, .avi, .mov, .mkv, .mp3, .wav, .aac, .eml, .html
176+
.pdf, .docx, .pptx, .md, .txt, .xlsx, .xls, .csv, .mp4, .avi, .mov, .mkv, .mp3, .wav, .aac, .eml, .html, .htm
177177
```
178178
179179

examples/index/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@ indexer:
99
uri: ./proc_demo.db
1010
name: my_db
1111
collection_name: my_docs
12-
documents_path: 'examples/process/outputs/merged/final_pp.jsonl'
12+
documents_path: 'examples/postprocessor/outputs/merged/final_pp.jsonl'

examples/postprocessor/config.yaml

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,23 @@
11
pp_modules:
2-
- type: chunker
2+
- type: file_namer
3+
- type: chunker
34
args:
45
chunking_strategy: sentence
6+
- type: translator
7+
args:
8+
target_language: en
9+
attachment_tag: <attachment>
10+
confidence_threshold: 0.7
11+
constrained_languages:
12+
- fr
13+
- en
14+
- type: metafuse
15+
args:
16+
metadata_keys:
17+
- file_name
18+
content_template: Content from {file_name}
19+
position: beginning
20+
521
output:
6-
output_path: examples/process/outputs/merged/
7-
save_each_step: True
22+
output_path: examples/postprocessor/outputs/merged/
23+
save_each_step: True

examples/rag/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
rag:
22
llm:
33
llm_name: OpenMeditron/meditron3-8b
4-
max_new_tokens: 250
4+
max_new_tokens: 1200
55
retriever:
66
db:
77
uri: ./proc_demo.db

examples/rag/config_api.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
rag:
33
# LLM Config
44
llm:
5-
llm_name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # "epfl-llm/meditron-70b" # "gpt-4o-mini" # Anything supported
6-
max_new_tokens: 100
5+
llm_name: Qwen/Qwen3-8B # "epfl-llm/meditron-70b" # "gpt-4o-mini" # Anything supported
6+
max_new_tokens: 1200
77
temperature: 0.8
88
# Retriever Config
99
retriever:

examples/retriever_api/config.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
db:
2+
uri: ./proc_demo.db
3+
name: my_db
4+
hybrid_search_weight: 0.5
5+
k: 5
6+
collection_name: my_docs

0 commit comments

Comments
 (0)