Skip to content

Commit 57b9730

Browse files
committed
Merge branch 'cherry2' of github.com:fabnemEPFL/mmore into cherry2
2 parents 34a2343 + d34f54f commit 57b9730

21 files changed

Lines changed: 263 additions & 63 deletions

.github/dependabot.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#package-ecosystem-
2+
version: 2
3+
updates:
4+
- package-ecosystem: "github-actions"
5+
directory: "/"
6+
schedule:
7+
interval: "monthly"
8+
- package-ecosystem: "pip"
9+
directory: "/"
10+
schedule:
11+
interval: "weekly"

.github/workflows/publish.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: 📦 Publish Python Package
2+
on:
3+
release:
4+
types: [published]
5+
permissions:
6+
contents: read
7+
jobs:
8+
release-build:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: actions/setup-python@v5
13+
with:
14+
python-version: "3.x"
15+
- name: Build release distributions
16+
run: |
17+
python -m pip install build
18+
python -m build
19+
- name: Upload distributions
20+
uses: actions/upload-artifact@v4
21+
with:
22+
name: release-dists
23+
path: dist/
24+
pypi-publish:
25+
runs-on: ubuntu-latest
26+
needs:
27+
- release-build
28+
permissions:
29+
id-token: write
30+
# Dedicated environments with protections for publishing are strongly recommended.
31+
environment:
32+
name: pypi
33+
url: https://pypi.org/p/mmore
34+
steps:
35+
- name: Retrieve release distributions
36+
uses: actions/download-artifact@v4
37+
with:
38+
name: release-dists
39+
path: dist/
40+
- name: Publish release distributions to PyPI
41+
uses: pypa/gh-action-pypi-publish@release/v1

.github/workflows/ruff.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
name: 🧹 Ruff linter checks
2+
on:
3+
push:
4+
branches:
5+
- master
6+
pull_request:
7+
jobs:
8+
lint:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: astral-sh/ruff-action@v3
13+
- run: ruff check
14+
- run: ruff format --check

.github/workflows/tests.yml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: 🧪 PyTest unit tests
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
pull_request:
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
13+
strategy:
14+
matrix:
15+
python-version: ["3.10", "3.11", "3.12"]
16+
17+
steps:
18+
- name: Checkout code
19+
uses: actions/checkout@v4
20+
21+
- name: Set up Python ${{ matrix.python-version }}
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: ${{ matrix.python-version }}
25+
26+
- name: Install dependencies
27+
run: |
28+
python -m pip install --upgrade pip
29+
pip install -e '.[rag,dev]' # or custom setup
30+
pip install pytest # if not in requirements.txt
31+
32+
- name: Run tests
33+
run: |
34+
pytest

.pre-commit-config.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
repos:
2+
- repo: https://github.com/PyCQA/isort
3+
rev: 6.0.1
4+
hooks:
5+
- id: isort
6+
- repo: https://github.com/astral-sh/ruff-pre-commit
7+
rev: v0.11.13
8+
hooks:
9+
- id: ruff
10+
args: [
11+
--fix, # auto-fix lint + style issues
12+
--unsafe-fixes, # allows formatting & import sorting
13+
]

Dockerfile

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,17 @@ RUN apt-get update && \
3232
dpkg-reconfigure --frontend noninteractive tzdata && \
3333
apt-get clean && rm -rf /var/lib/apt/lists/*
3434

35-
# Copy the project into the image
36-
ADD . /app
35+
# Create a non-root user
36+
RUN useradd -m -u 1000 mmoreuser
3737

38-
# Sync the project into a new environment, using the frozen lockfile
38+
# Set up working directory and permissions
39+
RUN mkdir -p /app && chown -R mmoreuser:mmoreuser /app
3940
WORKDIR /app
4041

42+
# Copy the project into the image and set ownership
43+
ADD . /app
44+
RUN chown -R mmoreuser:mmoreuser /app
45+
4146
# Define the build argument with a default value of an empty string (optional)
4247
COPY pyproject.toml ./
4348
# RUN uv sync --frozen ${UV_ARGUMENTS}
@@ -51,5 +56,7 @@ RUN uv pip install -e . --system
5156

5257
ENV DASK_DISTRIBUTED__WORKER__DAEMON=False
5358

54-
ENTRYPOINT /bin/bash
59+
USER mmoreuser
60+
61+
ENTRYPOINT ["/bin/bash"]
5562

README.md

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<h1 align="center">
1+
<h1 align="center">
22

33
![image](https://github.com/user-attachments/assets/502e2c7e-1200-498a-9ebd-10a27ed48ab6)
44

@@ -33,12 +33,14 @@ sudo apt install -y ffmpeg libsm6 libxext6 chromium-browser libnss3 \
3333
libpango-1.0-0 libpangoft2-1.0-0 weasyprint
3434
```
3535

36+
:warning: **On Ubuntu 24.04, replace `libasound2` with `libasound2t64`. You may also need to add the repository for Ubuntu 20.04 focal to have access to a few of the sources (e.g. create `/etc/apt/sources.list.d/mmore.list` with the contents `deb http://cz.archive.ubuntu.com/ubuntu focal main universe`).**
37+
3638
#### Step 1 – Install MMORE
3739

3840
To install the package simply run:
3941

4042
```bash
41-
pip install -e .
43+
pip install mmore
4244
```
4345

4446
> :warning: This is a big package with a lot of dependencies, so we recommend to use `uv` to handle `pip` installations. [Check our tutorial on uv](./docs/uv.md).
@@ -62,7 +64,7 @@ python -m mmore rag --config-file examples/rag/config.yaml
6264
You can also use our package in python code as shown here:
6365

6466
```python
65-
from mmore.process.processors.pdf_processor import PDFProcessor
67+
from mmore.process.processors.pdf_processor import PDFProcessor
6668
from mmore.process.processors.base import ProcessorConfig
6769
from mmore.type import MultimodalSample
6870

@@ -85,28 +87,28 @@ To launch the MMORE pipeline, follow the specialised instructions in the docs.
8587
![The MMORE pipelines archicture](https://github.com/user-attachments/assets/0cd61466-1680-43ed-9d55-7bd483a04a09)
8688

8789

88-
1. **:page_facing_up: Input Documents**
90+
1. **:page_facing_up: Input Documents**
8991
Upload your multimodal documents (PDFs, videos, spreadsheets, and m(m)ore) into the pipeline.
9092

91-
2. [**:mag: Process**](./docs/process.md)
93+
2. [**:mag: Process**](./docs/process.md)
9294
Extracts and standardizes text, metadata, and multimedia content from diverse file formats. Easily extensible! You can add your own processors to handle new file types.
9395
*Supports fast processing for specific types.*
9496

95-
3. [**:file_folder: Index**](./docs/index.md)
97+
3. [**:file_folder: Index**](./docs/index.md)
9698
Organizes extracted data into a **hybrid retrieval-ready Vector Store DB**, combining dense and sparse indexing through [Milvus](https://milvus.io/). Your vector DB can also be remotely hosted and then you only have to provide a standard API. There is also an [HTTP Index API](./docs/index_api.md) for adding new files on the fly with HTTP requests.
9799

98-
4. [**:robot: RAG**](./docs/rag.md)
100+
4. [**:robot: RAG**](./docs/rag.md)
99101
Use the indexed documents inside a **Retrieval-Augmented Generation (RAG) system** that provides a [LangChain](https://www.langchain.com/) interface. Plug in any LLM with a compatible interface or add new ones through an easy-to-use interface.
100102
*Supports API hosting or local inference.*
101103

102-
5. **:tada: Evaluation**
104+
5. **:tada: Evaluation**
103105
*Coming soon*
104106
An easy way to evaluate the performance of your RAG system using Ragas.
105107

106108
See [the `/docs` directory](./docs) for additional details on each modules and hands-on tutorials on parts of the pipeline.
107109

108110

109-
#### :construction: Supported File Types
111+
#### :construction: Supported File Types
110112

111113
| **Category** | **File Types** | **Supported Device** | **Fast Mode** |
112114
|--------------------|------------------------------------------|--------------------------| --------------------------|
@@ -123,9 +125,25 @@ We welcome contributions to improve the current state of the pipeline, feel free
123125
- Open an issue to report a bug or ask for a new feature
124126
- Open a pull request to fix a bug or add a new feature
125127
- You can find ongoing new features and bugs in the [Issues]
126-
128+
127129
Don't hesitate to star the project :star: if you find it interesting! (you would be our star).
128130

131+
### To make sure your code is pretty, this repo has a `pre-commit` configuration file that runs linters (`isort`, `black`)
132+
133+
1. Install pre-commit if you haven't already
134+
135+
`pip install pre-commit`
136+
137+
2. Set up the git hook scripts
138+
139+
`pre-commit install`
140+
141+
3. Run the checks manually (optional but good before first commit)
142+
143+
`pre-commit run --all-files`
144+
145+
We also use `pyright` to type-check the code base, please make sure your Pull Requests are type-checked.
146+
129147
## License
130148

131149
This project is licensed under the Apache 2.0 License, see the [LICENSE :mortar_board:](LICENSE) file for details.

docs/installation.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,39 @@ sudo docker build --build-arg PLATFORM=cpu -t mmore .
8080
##### Step 3: Start an interactive session
8181

8282
```bash
83-
sudo docker run -it -v ./examples:/app/examples mmore
83+
sudo docker run --gpus all -it -v ./examples:/app/examples -v ./.cache:/mmoreuser/.cache mmore
8484
```
8585

86+
For CPU-only platforms:
87+
```bash
88+
sudo docker run -it -v ./examples:/app/examples -v ./.cache:/mmoreuser/.cache mmore
89+
```
90+
91+
> [!WARNING]
92+
> You may need the Nvidia toolkit so the containers can access your GPUs.
93+
> Read [this tutorial](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) if something breaks here!
94+
>
95+
> Configure the production repository:
96+
>
97+
> ```sh
98+
> curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
99+
> && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
100+
> sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
101+
> sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
102+
> ```
103+
>
104+
> ```sh
105+
> sudo apt update
106+
> sudo apt install -y nvidia-container-toolkit
107+
> ```
108+
>
109+
> Modify the Docker daemon to use Nvidia:
110+
>
111+
> ```sh
112+
> sudo nvidia-ctk runtime configure --runtime=docker
113+
> sudo systemctl restart docker
114+
> ```
115+
>
116+
> You can now use `docker run --gpus all`!
117+
86118
*Note:* The `examples` folder is mapped to `/app/examples` inside the container, corresponding to the default path in `examples/process/config.yaml`.

0 commit comments

Comments
 (0)