Skip to content

Commit 78b7039

Browse files
committed
Merge branch 'feat/document-citations' of https://github.com/rachedkko/mmore into feat/document-citations
# Conflicts: # src/mmore/run_index_api.py # src/mmore/run_retriever.py
2 parents b687a9d + 6565253 commit 78b7039

111 files changed

Lines changed: 17569 additions & 5605 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.dockerignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
.venv
2-
tests
3-
test_data
2+
.git
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"include": ["../../src/mmore/colvision", "../../src/mmore/tui"],
3+
"exclude": ["../../build", "../../dist", "**/__pycache__", "*.egg-info"],
4+
"executionEnvironments": [
5+
{
6+
"root": "../../src/mmore/colvision",
7+
"extraPaths": ["../../src"]
8+
},
9+
{
10+
"root": "../../src/mmore/tui",
11+
"extraPaths": ["../../src"]
12+
}
13+
],
14+
"pythonVersion": "3.11",
15+
"typeCheckingMode": "standard"
16+
}

.github/workflows/pyright.yml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: 📐 Pyright type checks
2+
on:
3+
push:
4+
branches:
5+
- master
6+
pull_request:
7+
workflow_dispatch:
8+
9+
jobs:
10+
pyright:
11+
name: "Pyright"
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v6
15+
16+
- name: Set up Python 3.11
17+
uses: actions/setup-python@v6
18+
with:
19+
python-version: "3.11"
20+
21+
- name: Install uv
22+
run: pipx install uv
23+
24+
- name: Install base dependencies
25+
run: uv venv .venv && source .venv/bin/activate && uv pip install -e ".[process,index,rag,api,tui,cpu,dev,websearch]"
26+
27+
- name: Run Pyright (base)
28+
continue-on-error: true
29+
run: source .venv/bin/activate && pyright --project pyrightconfig.json
30+
31+
- name: Install colvision dependencies
32+
run: uv venv .venv-colvision && source .venv-colvision/bin/activate && uv pip install -e ".[colvision,tui,cpu,dev]"
33+
34+
- name: Run Pyright (colvision)
35+
continue-on-error: true
36+
run: source .venv-colvision/bin/activate && pyright --project .github/pyright/pyrightconfig.colvision.json

.github/workflows/tests.yml

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,25 +11,27 @@ jobs:
1111
runs-on: ubuntu-latest
1212

1313
strategy:
14+
fail-fast: false
1415
matrix:
1516
python-version: ["3.11", "3.12"]
1617

18+
name: test (py${{ matrix.python-version }})
19+
1720
steps:
1821
- name: Checkout code
1922
uses: actions/checkout@v6
2023

21-
- name: Install uv and create venv
22-
run: |
23-
pipx install uv
24-
uv venv .venv
24+
- name: Install uv
25+
run: pipx install uv
2526

2627
- name: Set up Python ${{ matrix.python-version }}
2728
uses: actions/setup-python@v6
2829
with:
2930
python-version: ${{ matrix.python-version }}
3031

31-
- name: Install dependencies (using uv)
32+
- name: Install dependencies - process (using uv)
3233
run: |
34+
uv venv .venv
3335
source .venv/bin/activate
3436
uv pip install -e ".[process,index,rag,api,cpu,dev,websearch]"
3537
@@ -39,7 +41,18 @@ jobs:
3941
uv pip show cohere || echo "Cohere not installed"
4042
uv pip show langchain-cohere || echo "Langchain-cohere not installed"
4143
42-
- name: Run tests
44+
- name: Run tests - process
4345
run: |
4446
source .venv/bin/activate
4547
pytest
48+
49+
- name: Install dependencies - colvision
50+
run: |
51+
uv venv .venv-colvision
52+
source .venv-colvision/bin/activate
53+
uv pip install -e ".[colvision,cpu,dev]"
54+
55+
- name: Run tests - colvision
56+
run: |
57+
source .venv-colvision/bin/activate
58+
pytest tests/test_colvision.py

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ venv.bak/
114114
# Milvus DB
115115
db/
116116
*.db
117+
*.db.lock
117118

118119
# Project files
119120
tmp/

README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
MMORE is an open-source, end-to-end pipeline to ingest, process, index, and retrieve knowledge from heterogeneous files: PDFs, Office docs, spreadsheets, emails, images, audio, video, and web pages. It standardizes content into a unified multimodal format, supports distributed CPU/GPU processing, and provides hybrid dense+sparse retrieval with an integrated RAG service (CLI, APIs).
1818

19-
👉 Read the paper for more details (OpenReview): [MMORE: Massive Multimodal Open RAG & Extraction](https://openreview.net/forum?id=6j1HjfIdKn)
19+
👉 Read the paper for more details (arXiv): [MMORE: Massive Multimodal Open RAG & Extraction](https://arxiv.org/abs/2509.11937)
2020

2121

2222
### Documentation
@@ -66,6 +66,8 @@ brew install cairo pango gdk-pixbuf libffi
6666
uv pip install weasyprint
6767
```
6868

69+
You can also run MMORE on Windows by following our [Windows setup notes](docs/source/getting_started/windows.md).
70+
6971
#### Step 1 – Install MMORE
7072

7173
Dependencies are split by pipeline stage. Install only what you need:
@@ -103,6 +105,22 @@ uv pip install "mmore[process,cpu]"
103105
104106
> :warning: **Check the instructions for contributors directly at [`docs/for_devs.md`](./docs/for_devs.md)**
105107
108+
### Interactive TUI
109+
110+
Prefer a guided experience over editing YAML by hand? Install the `tui` extra and launch the interactive Terminal UI:
111+
112+
```bash
113+
uv sync --extra tui
114+
mmore tui
115+
```
116+
117+
From the launcher you can:
118+
119+
- run any stage (process / postprocess / index / rag / chat) interactively,
120+
- chain the full pipeline (process → postprocess → index → chat),
121+
- generate stage YAML configs through a guided wizard,
122+
- pick from existing example configs without leaving the terminal.
123+
106124
### Minimal Example
107125

108126
You can use our predefined CLI commands to execute parts of the pipeline. Note that you might need to prepend `python -m` to the command if the package does not properly create bash aliases.
@@ -142,7 +160,7 @@ MultimodalSample.to_jsonl(out_file, result_pdf)
142160

143161
To launch the MMORE pipeline, follow the specialised instructions in the docs.
144162

145-
![The MMORE pipelines architecture](https://github.com/user-attachments/assets/0cd61466-1680-43ed-9d55-7bd483a04a09)
163+
![The MMORE pipelines architecture](docs/source/doc_images/pipeline_mmore+.png)
146164

147165

148166
1. **:page_facing_up: Input Documents**

docker/arch/Dockerfile

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,15 @@ ARG UV_OVERRIDE
33

44
# --- Base images ---
55
FROM archlinux:latest AS gpu
6-
ENV UV_ARGUMENTS="--extra all,cu126"
7-
RUN echo "Using GPU image"
6+
ENV SYNC_ARGS="--extra all --extra cu126"
87

98
FROM archlinux:latest AS cpu
10-
ENV UV_ARGUMENTS="--extra all,cpu"
11-
RUN echo "Using CPU-only image"
9+
ENV SYNC_ARGS="--extra all --extra cpu"
1210

1311
# --- Build ---
1412
FROM ${DEVICE:-gpu} AS build
1513
ARG UV_OVERRIDE
16-
ENV UV_ARGUMENTS=${UV_OVERRIDE:-$UV_ARGUMENTS}
14+
ENV SYNC_ARGS=${UV_OVERRIDE:-$SYNC_ARGS}
1715
RUN echo "DisableSandbox" >> /etc/pacman.conf
1816

1917
ARG USER_UID=1000
@@ -22,15 +20,22 @@ ARG USER_GID=1000
2220
# Install system dependencies
2321
RUN pacman -Syu --noconfirm && \
2422
pacman -S --noconfirm --needed \
25-
base-devel python uv \
26-
tzdata curl ffmpeg \
23+
base-devel \
24+
tzdata curl git ffmpeg \
2725
libsm libxext nss \
2826
libxi libxrandr libxcomposite libxcursor libxdamage libxfixes libxrender \
2927
alsa-lib atk gtk3 libreoffice-fresh libjpeg-turbo pango && \
3028
ln -fs /usr/share/zoneinfo/Europe/Zurich /etc/localtime && \
3129
echo "Europe/Zurich" > /etc/timezone && \
3230
pacman -Sc --noconfirm
3331

32+
# Install uv
33+
COPY --from=ghcr.io/astral-sh/uv:0.7.13 /uv /uvx /bin/
34+
35+
ENV UV_PYTHON=3.11 \
36+
UV_PROJECT_ENVIRONMENT=/app/.venv \
37+
UV_LINK_MODE=copy
38+
3439
# Create non-root user
3540
RUN groupadd --gid ${USER_GID} mmoreuser \
3641
&& useradd --uid ${USER_UID} --gid ${USER_GID} -m mmoreuser
@@ -39,14 +44,14 @@ WORKDIR /app
3944
RUN chown mmoreuser:mmoreuser /app
4045
USER mmoreuser
4146

42-
# Set up Python virtual environment
43-
ENV VIRTUAL_ENV=/app/.venv
44-
RUN uv venv --python 3.11 .venv \
45-
&& uv pip install --no-cache weasyprint
46-
4747
# Install dependencies (cached unless uv.lock changes)
4848
COPY --chown=mmoreuser:mmoreuser pyproject.toml uv.lock /app/
49-
RUN uv sync --frozen --no-install-project ${UV_ARGUMENTS}
49+
RUN --mount=type=cache,target=/home/mmoreuser/.cache/uv,uid=${USER_UID},gid=${USER_GID} \
50+
uv sync --frozen --no-install-project ${SYNC_ARGS}
51+
52+
# weasyprint needs to be installed using pip
53+
RUN --mount=type=cache,target=/home/mmoreuser/.cache/uv,uid=${USER_UID},gid=${USER_GID} \
54+
uv pip install weasyprint
5055

5156
# Install mmore from local source code
5257
COPY --chown=mmoreuser:mmoreuser . /app

docker/arch/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ CPU-only:
1919
sudo docker build -f docker/arch/Dockerfile --build-arg DEVICE=cpu -t mmore:arch-cpu .
2020
```
2121

22-
Custom extras (overrides the default `--extra all,cu126` or `--extra all,cpu`):
22+
Custom extras (overrides the default `--extra all --extra cu126` or `--extra all --extra cpu`):
2323
```bash
24-
sudo docker build -f docker/arch/Dockerfile --build-arg UV_OVERRIDE="--extra all,cu126" -t mmore:arch .
24+
sudo docker build -f docker/arch/Dockerfile --build-arg UV_OVERRIDE="--extra process --extra rag --extra cpu" -t mmore:arch .
2525
```
2626

2727
Custom user UID/GID (e.g. for RCP):

docker/leap/Dockerfile

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,43 +3,48 @@ ARG UV_OVERRIDE
33

44
# --- Base images ---
55
FROM opensuse/leap:15.6 AS gpu
6-
ENV UV_ARGUMENTS="--extra all,cu126"
7-
RUN echo "Using GPU image"
6+
ENV SYNC_ARGS="--extra all --extra cu126"
87

98
FROM opensuse/leap:15.6 AS cpu
10-
ENV UV_ARGUMENTS="--extra all,cpu"
11-
RUN echo "Using CPU-only image"
9+
ENV SYNC_ARGS="--extra all --extra cpu"
1210

1311
# --- Build ---
1412
FROM ${DEVICE:-gpu} AS build
1513
ARG UV_OVERRIDE
16-
ENV UV_ARGUMENTS=${UV_OVERRIDE:-$UV_ARGUMENTS}
14+
ENV SYNC_ARGS=${UV_OVERRIDE:-$SYNC_ARGS}
1715

1816
# Install system dependencies
1917
RUN zypper --non-interactive refresh && \
2018
zypper --non-interactive install -y \
21-
python311 python311-pip \
22-
nano curl ffmpeg \
19+
nano curl git ffmpeg \
2320
libSM6 libXext6 mozilla-nss \
2421
libXi6 libXrandr2 libXcomposite1 libXcursor1 libXdamage1 libXfixes3 libXrender1 \
2522
libasound2 libatk-1_0-0 gtk3 libreoffice libjpeg8-devel libpango-1_0-0 && \
2623
ln -fs /usr/share/zoneinfo/Europe/Zurich /etc/localtime && \
2724
echo "Europe/Zurich" > /etc/timezone && \
2825
zypper clean --all
2926

30-
WORKDIR /app
27+
# Install uv
28+
COPY --from=ghcr.io/astral-sh/uv:0.7.13 /uv /uvx /bin/
29+
30+
ENV UV_PYTHON=3.11 \
31+
UV_PROJECT_ENVIRONMENT=/app/.venv \
32+
UV_LINK_MODE=copy
3133

32-
# Set up Python virtual environment
33-
RUN python3.11 -m venv .venv \
34-
&& .venv/bin/pip install --no-cache-dir uv weasyprint
34+
WORKDIR /app
3535

3636
# Install dependencies (cached unless uv.lock changes)
3737
COPY pyproject.toml uv.lock /app/
38-
RUN .venv/bin/uv sync --frozen --no-install-project ${UV_ARGUMENTS}
38+
RUN --mount=type=cache,target=/root/.cache/uv \
39+
uv sync --frozen --no-install-project ${SYNC_ARGS}
40+
41+
# weasyprint needs to be installed using pip
42+
RUN --mount=type=cache,target=/root/.cache/uv \
43+
uv pip install weasyprint
3944

4045
# Install mmore from local source code
4146
COPY . /app
42-
RUN .venv/bin/uv pip install --no-cache --no-deps -e .
47+
RUN uv pip install --no-cache --no-deps -e .
4348

4449
# --- Runtime ---
4550
ENV PATH="/app/.venv/bin:$PATH"

docker/leap/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ CPU-only:
1919
sudo docker build -f docker/leap/Dockerfile --build-arg DEVICE=cpu -t mmore:leap-cpu .
2020
```
2121

22-
Custom extras (overrides the default `--extra all,cu126` or `--extra all,cpu`):
22+
Custom extras (overrides the default `--extra all --extra cu126` or `--extra all --extra cpu`):
2323
```bash
24-
sudo docker build -f docker/leap/Dockerfile --build-arg UV_OVERRIDE="--extra all,cu126" -t mmore:leap .
24+
sudo docker build -f docker/leap/Dockerfile --build-arg UV_OVERRIDE="--extra process --extra rag --extra cpu" -t mmore:leap .
2525
```
2626

2727
## Run

0 commit comments

Comments
 (0)