Skip to content

Commit 7794813

Browse files
committed
Merge remote-tracking branch 'origin/v2' into feat/agent-framework-integration
2 parents 5a10969 + 04d6eea commit 7794813

24 files changed

Lines changed: 11081 additions & 4946 deletions

.github/workflows/pyright.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,15 @@ jobs:
3232
run: |
3333
source .venv/bin/activate
3434
pyright
35+
36+
- name: Install dependencies - colvision
37+
run: |
38+
uv venv .venv-colvision
39+
source .venv-colvision/bin/activate
40+
uv pip install -e ".[colvision,cpu,dev]"
41+
42+
- name: Run Pyright - colvision
43+
continue-on-error: true
44+
run: |
45+
source .venv-colvision/bin/activate
46+
pyright src/mmore/colvision

.github/workflows/tests.yml

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,25 +11,27 @@ jobs:
1111
runs-on: ubuntu-latest
1212

1313
strategy:
14+
fail-fast: false
1415
matrix:
1516
python-version: ["3.11", "3.12"]
1617

18+
name: test (py${{ matrix.python-version }})
19+
1720
steps:
1821
- name: Checkout code
1922
uses: actions/checkout@v6
2023

21-
- name: Install uv and create venv
22-
run: |
23-
pipx install uv
24-
uv venv .venv
24+
- name: Install uv
25+
run: pipx install uv
2526

2627
- name: Set up Python ${{ matrix.python-version }}
2728
uses: actions/setup-python@v6
2829
with:
2930
python-version: ${{ matrix.python-version }}
3031

31-
- name: Install dependencies (using uv)
32+
- name: Install dependencies - process (using uv)
3233
run: |
34+
uv venv .venv
3335
source .venv/bin/activate
3436
uv pip install -e ".[process,index,rag,api,cpu,dev,websearch,privacy]"
3537
@@ -39,7 +41,18 @@ jobs:
3941
uv pip show cohere || echo "Cohere not installed"
4042
uv pip show langchain-cohere || echo "Langchain-cohere not installed"
4143
42-
- name: Run tests
44+
- name: Run tests - process
4345
run: |
4446
source .venv/bin/activate
45-
pytest
47+
pytest --ignore=tests/test_colvision.py
48+
49+
- name: Install dependencies - colvision
50+
run: |
51+
uv venv .venv-colvision
52+
source .venv-colvision/bin/activate
53+
uv pip install -e ".[colvision,cpu,dev]"
54+
55+
- name: Run tests - colvision
56+
run: |
57+
source .venv-colvision/bin/activate
58+
pytest tests/test_colvision.py

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,4 +131,5 @@ test*.sh
131131
examples/outputs
132132
outputs/
133133

134-
paper/
134+
paper/
135+
chitchat/
Lines changed: 76 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,49 @@
1-
# 🖼️ ColPali Integration
1+
# 🖼️ ColVision Integration
22

3-
## Overview
3+
PDF retrieval pipeline using ColVision embeddings, stored in Milvus.
44

5-
This module provides a complete pipeline for processing PDF documents with ColPali embeddings, storing them in a Milvus vector database, and performing semantic search.
5+
## Installation
66

7-
It is designed for efficient document retrieval and RAG applications.
7+
The `[colvision]` extra is mutually exclusive with `[process]` — use a dedicated venv.
8+
9+
```bash
10+
uv sync --extra colvision
11+
```
12+
13+
## Supported Models
14+
15+
| Model | `model_name` |
16+
|---|---|
17+
| ColPali v1.3 | `vidore/colpali-v1.3` |
18+
| ColQwen2 v1.0 | `vidore/colqwen2-v1.0` |
19+
| ColQwen2.5 v0.2 | `vidore/colqwen2.5-v0.2` |
20+
| ColGemma3 | `Cognitive-Lab/ColNetraEmbed` |
21+
| ColSmol 256M | `vidore/colSmol-256M` |
22+
| ColSmol 500M | `vidore/colSmol-500M` |
23+
24+
All models are installed with the single `[colvision]` extra.
25+
26+
The model/processor class is auto-detected from `model_name`, and the embedding dimension is inferred at every stage (from the loaded model at `process` / `retrieve` time, from the parquet contents at `index` time).
27+
28+
## Choosing a Model
29+
30+
Set `model_name` in the YAML config, or override it via the `-m` / `--model` CLI flag on the `process` and `retrieve` commands.
31+
32+
The pipeline runs in three steps — `process`, then `index`, then `retrieve` — and the
33+
`-m` / `--model` flag must be passed to both `process` and `retrieve`:
34+
35+
```bash
36+
# 1. Process PDFs into embeddings
37+
python3 -m mmore colvision process --config-file examples/colvision/config_process.yml -m vidore/colqwen2.5-v0.2
38+
39+
# 2. Index the embeddings into Milvus (no model needed here)
40+
python3 -m mmore colvision index --config-file examples/colvision/config_index.yml
41+
42+
# 3. Retrieve with the same model used at processing time
43+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml -m vidore/colqwen2.5-v0.2
44+
```
45+
46+
> **Important:** the same model must be used across `process` and `retrieve` — mixing produces incorrect results.
847
948
## 🧭 Architecture
1049

@@ -17,28 +56,32 @@ The system consists of three main components:
1756
## 📁 File Structure
1857

1958
```
20-
src/mmore/colpali/
21-
├── milvuscolpali.py # Milvus database management
59+
src/mmore/colvision/
60+
├── model_utils.py # Model/processor class resolution
61+
├── milvuscolvision.py # Milvus database management
2262
├── run_index.py # Indexing pipeline
23-
├── run_process.py # PDF processing pipeline
63+
├── run_process.py # PDF processing pipeline
2464
├── run_retriever.py # Search and retrieval API
25-
└── retriever.py # ColPaliRetriever class for RAG integration
65+
└── retriever.py # ColVisionRetriever class for RAG integration
2666
```
2767

2868
## 🚀 Quick Start
2969

3070
### 1. Process PDFs into embeddings
3171

3272
```bash
33-
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
73+
python3 -m mmore colvision process --config-file examples/colvision/config_process.yml
74+
75+
# Or override the model from the command line
76+
python3 -m mmore colvision process --config-file examples/colvision/config_process.yml --model vidore/colqwen2.5-v0.2
3477
```
3578

3679
**Example config (`config_process.yml`):**
3780
```yaml
3881
data_path:
3982
- 'examples/sample_data/pdf'
4083
output_path: "./output"
41-
model_name: "vidore/colpali-v1.3"
84+
model_name: "vidore/colqwen2.5-v0.2"
4285
skip_already_processed: true
4386
num_workers: 5
4487
batch_size: 8
@@ -47,7 +90,7 @@ batch_size: 8
4790
### 2. Index embeddings into Milvus
4891
4992
```bash
50-
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
93+
python3 -m mmore colvision index --config-file examples/colvision/config_index.yml
5194
```
5295

5396
**Example config (`config_index.yml`):**
@@ -57,7 +100,6 @@ milvus:
57100
db_path: ./output/milvus_data.db
58101
collection_name: pdf_pages
59102
create_collection: true
60-
dim: 128
61103
metric_type: IP
62104
```
63105
@@ -66,47 +108,31 @@ milvus:
66108
#### Retrieval Server Mode
67109
```bash
68110
# Start the retrieval API server
69-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml
111+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml
70112
```
71113

72114
Or with a custom host and port:
73115
```bash
74-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --host 0.0.0.0 --port 8001
116+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml --host 0.0.0.0 --port 8001
75117
```
76118

77119
**Example config (`config_retrieval.yml`):**
78120
```yaml
79-
db_path: "./milvus_data"
121+
db_path: "./output/milvus_data.db"
80122
collection_name: "pdf_pages"
81-
model_name: "vidore/colpali-v1.3"
123+
model_name: "vidore/colqwen2.5-v0.2"
82124
top_k: 3
83-
dim: 128
84-
max_workers: 16
85125
metric_type: "IP"
126+
max_workers: 16
86127
text_parquet_path: "./output/pdf_page_text.parquet"
87128
```
88129
89-
#### Single Query Mode
90-
```bash
91-
# Run retrieval for a single query defined in the config file
92-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval_single.yml
93-
```
94-
95-
**Example config (`config_retrieval_single.yml`):**
96-
```yaml
97-
mode: "single"
98-
db_path: "./milvus_data"
99-
collection_name: "pdf_pages"
100-
model_name: "vidore/colpali-v1.3"
101-
query: "What may lead to dysbiosis and inflammation?"
102-
top_k: 5
103-
```
104130
Host and port are specified via CLI flags (`--host` and `--port`), not in the config file.
105131

106132
#### Batch Mode
107133
```bash
108134
# Process queries from file
109-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml --input-file queries.jsonl --output-file results.json
135+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml --input-file queries.jsonl --output-file results.json
110136
```
111137

112138
**Example queries file (`queries.jsonl`):**
@@ -119,20 +145,9 @@ Each line should be a JSON-encoded string (one query per line):
119145

120146
Each line must be a valid JSON string, including quotes, since the file is parsed line by line with `json.loads()`.
121147

122-
**Example config (`config_retrieval.yml`):**
123-
```yaml
124-
db_path: "./milvus_data"
125-
collection_name: "pdf_pages"
126-
model_name: "vidore/colpali-v1.3"
127-
top_k: 5
128-
dim: 128
129-
max_workers: 16
130-
text_parquet_path: "./output/pdf_page_text.parquet"
131-
```
132-
133148
## 🔧 Core Components
134149

135-
### MilvusColpaliManager
150+
### MilvusColvisionManager
136151
- manages local Milvus database operations
137152
- handles collection creation and indexing
138153
- provides efficient batch insertion
@@ -146,14 +161,14 @@ text_parquet_path: "./output/pdf_page_text.parquet"
146161

147162
### PDF Processor
148163
- converts PDF pages to images
149-
- generates ColPali embeddings
164+
- generates ColVision embeddings
150165
- handles parallel processing
151166
- supports stop-and-resume workflows for large datasets
152167

153168
**Processing Flow:**
154169
1. Crawl PDF files from specified directories
155170
2. Convert each page to high-resolution PNG
156-
3. Generate embeddings using ColPali model
171+
3. Generate embeddings using the configured model
157172
4. Store results in Parquet format
158173

159174
### Retriever
@@ -193,28 +208,25 @@ curl -X POST "http://localhost:8001/v1/retrieve" \
193208

194209
### RAG Pipeline Integration
195210
```python
196-
from mmore.colpali.retriever import ColPaliRetriever, ColPaliRetrieverConfig
197-
from mmore.rag.pipeline import RAGPipeline, RAGConfig
211+
from mmore.colvision.retriever import ColVisionRetriever, ColVisionRetrieverConfig
198212
199-
# Create ColPali retriever with text support
200-
colpali_config = ColPaliRetrieverConfig(
213+
config = ColVisionRetrieverConfig(
201214
db_path="./output/milvus_data.db",
202215
collection_name="pdf_pages",
203-
model_name="vidore/colpali-v1.3",
216+
model_name="vidore/colqwen2.5-v0.2",
204217
text_parquet_path="./output/pdf_page_text.parquet",
205218
top_k=3,
206-
dim=128,
207219
max_workers=16,
208220
metric_type="IP",
209221
)
210-
colpali_retriever = ColPaliRetriever.from_config(colpali_config)
222+
retriever = ColVisionRetriever.from_config(config)
211223
212224
# Use with RAG pipeline (requires LLM config)
213-
# rag_config = RAGConfig(retriever=colpali_retriever, ...)
225+
# rag_config = RAGConfig(retriever=retriever, ...)
214226
# rag_pipeline = RAGPipeline.from_config(rag_config)
215227
```
216228

217-
The `ColPaliRetriever` is a LangChain-compatible `BaseRetriever` that returns `Document` objects with:
229+
The `ColVisionRetriever` is a LangChain-compatible `BaseRetriever` that returns `Document` objects with:
218230
- `page_content`: the text content from the PDF page, if `text_parquet_path` is provided
219231
- `metadata`: contains `pdf_name`, `pdf_path`, `page_number`, `rank`, and `similarity` score
220232

@@ -282,13 +294,13 @@ The `ColPaliRetriever` is a LangChain-compatible `BaseRetriever` that returns `D
282294
### Complete Workflow
283295
```bash
284296
# 1. Process all PDFs in a directory
285-
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
297+
python3 -m mmore colvision process --config-file examples/colvision/config_process.yml
286298
287299
# 2. Index the embeddings
288-
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
300+
python3 -m mmore colvision index --config-file examples/colvision/config_index.yml
289301
290302
# 3. Start the API server
291-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml
303+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml
292304
293305
# 4. Query the system
294306
curl -X POST "http://localhost:8001/v1/retrieve" \
@@ -299,13 +311,13 @@ curl -X POST "http://localhost:8001/v1/retrieve" \
299311
### Alternative: Batch processing
300312
```bash
301313
# 1. Process PDFs (same as above)
302-
python3 -m mmore colpali process --config-file examples/colpali/config_process.yml
314+
python3 -m mmore colvision process --config-file examples/colvision/config_process.yml
303315
304316
# 2. Index embeddings (same as above)
305-
python3 -m mmore colpali index --config-file examples/colpali/config_index.yml
317+
python3 -m mmore colvision index --config-file examples/colvision/config_index.yml
306318
307319
# 3. Run batch retrieval
308-
python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieval.yml \
320+
python3 -m mmore colvision retrieve --config-file examples/colvision/config_retrieval.yml \
309321
--input-file queries.jsonl \
310322
--output-file results.json
311323
```
@@ -319,7 +331,7 @@ python3 -m mmore colpali retrieve --config-file examples/colpali/config_retrieva
319331
### For better accuracy
320332
- use higher DPI in PDF conversion, default is 200
321333
- increase `top_k` in retrieval to inspect more candidate pages
322-
- consider using larger ColPali models if available
334+
- consider using more recent ColVision models (ColQwen2.5, ColGemma3)
323335

324336
### For production
325337
- run Milvus in distributed mode for larger datasets

0 commit comments

Comments
 (0)