Skip to content

Commit dc1fe6c

Browse files
committed
feat(process): add MistralOCRProcessor as alternative PDF backend
Introduce a hosted-OCR backend (Mistral OCR) selectable at runtime, leaving the existing Marker/Surya pipeline as the default. Selection is driven by a new dispatcher_config.pdf_backend flag (marker|mistral) that is exported to the MMORE_PDF_BACKEND env var so both PDF processors can disambiguate in their accepts() method. - src/mmore/process/processors/mistral_ocr_processor.py: new processor that calls Mistral OCR via the mistralai SDK, returns markdown + optional images, and produces the same MultimodalSample shape as the existing PDFProcessor. - src/mmore/process/processors/pdf_processor.py: accepts() now defers when MMORE_PDF_BACKEND=mistral. - src/mmore/process/dispatcher.py: DispatcherConfig.pdf_backend field propagated from YAML. - production-config/process/config.yaml: documents the new flag and adds a MistralOCRProcessor section. - pyproject.toml: adds mistralai>=2.4 in the `process` extra. - docs/mistral_ocr_cost_estimate.md: budget rationale for adopting Mistral OCR on a 1000-PDF corpus.
1 parent 1ac412e commit dc1fe6c

7 files changed

Lines changed: 290 additions & 1 deletion

File tree

docs/mistral_ocr_cost_estimate.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Budget estimate — Mistral OCR for mmore benchmark
2+
3+
**Goal:** request lab access to a Mistral API key in order to integrate
4+
`mistral-ocr-latest` as an alternative backend for mmore's `PDFProcessor`, and
5+
benchmark its extraction quality against the current pipeline (Marker/Surya) on
6+
a **1000-PDF** corpus.
7+
8+
## Mistral OCR pricing
9+
10+
Source: Mistral AI public pricing (announced March 2025).
11+
12+
| Mode | Price | Latency | Recommended usage |
13+
| -------- | ------------------ | ----------------- | --------------------------- |
14+
| Standard | $1.00 / 1000 pages | near real-time | dev, debug, small volumes |
15+
| Batch | $0.50 / 1000 pages | a few hours | offline benchmarks, bulk |
16+
17+
Assumed conversion: **1 USD ≈ 0.90 CHF** (to be re-confirmed at purchase time).
18+
19+
## Estimates for 1000 PDFs
20+
21+
Cost scales with the number of **pages**, not the number of files. A few
22+
profiles depending on corpus shape:
23+
24+
| Corpus profile | Pages/PDF | Total pages | Standard cost | Standard (CHF) | Batch (CHF) |
25+
| ---------------------------- | --------- | ----------- | ------------- | -------------- | ----------- |
26+
| Slides, short memos | 5 | 5,000 | $5 | **~4.5** | ~2.3 |
27+
| Mid-sized reports / papers | 15 | 15,000 | $15 | **~13.5** | ~6.8 |
28+
| Long reports (WHO-style) | 30 | 30,000 | $30 | **~27** | ~13.5 |
29+
| Books, theses | 50 | 50,000 | $50 | **~45** | ~22.5 |
30+
| Very long (200-page theses) | 200 | 200,000 | $200 | **~180** | ~90 |
31+
32+
## Recommendation for the lab
33+
34+
Target corpus: documents like those in **examples/who/** (WHO reports,
35+
guidelines), typically 20–80 pages.
36+
37+
- **Central estimate:** ~30 pages × 1000 PDFs = 30,000 pages → **~27 CHF** standard, **~14 CHF** batch.
38+
- **Recommended budget (×3 margin for re-runs, debug, ablations):** **80–100 CHF**.
39+
40+
## Caveats worth flagging
41+
42+
1. **Per-page billing**, not per-document — measure actual corpus size before purchase.
43+
2. **Frequent re-runs** during benchmarking (prompt tweaks, hyperparams, sample size): provision ×2 to ×3.
44+
3. **Floating USD/CHF rate** — re-check on the day of purchase.
45+
4. **No cost on the current pipeline side** (Marker/Surya runs on lab GPUs, RCP/CSCS).
46+
47+
## Measuring the exact page count before purchase
48+
49+
```bash
50+
find <corpus_dir> -name "*.pdf" -exec pdfinfo {} \; \
51+
| grep "^Pages" | awk '{s+=$2} END {print "Total pages:", s}'
52+
```
53+
54+
Estimated cost = `total_pages * 0.001 * 0.9` CHF in standard mode,
55+
or `total_pages * 0.0005 * 0.9` CHF in batch mode.
56+
57+
## Summary / request
58+
59+
> To integrate and benchmark Mistral OCR (`mistral-ocr-latest`) as an
60+
> alternative backend for mmore's PDFProcessor on a 1000-document corpus
61+
> (~30,000 pages estimated), I'm requesting access to a Mistral API key with a
62+
> projected budget of **~100 CHF** covering the initial run plus 2–3 bench
63+
> iterations (prompt changes, ablations on sub-corpora).

production-config/process/config.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ dispatcher_config:
55
distributed: false
66
extract_images: true
77
scheduler_file: $ROOT_OUT_DIR/scheduler-file.json #put absolute path!
8+
# PDF backend: "marker" (default, local GPU via Marker/Surya) or "mistral"
9+
# (hosted Mistral OCR API, requires MISTRAL_API_KEY env var).
10+
pdf_backend: marker
811
process_batch_sizes:
912
- URLProcessor: 40
1013
- DOCXProcessor: 100
@@ -24,6 +27,9 @@ dispatcher_config:
2427
- sample_rate: 10
2528
- batch_size: 4
2629

30+
MistralOCRProcessor:
31+
- mistral_ocr_model: "mistral-ocr-latest"
32+
2733
PDFProcessor:
2834
- PDFTEXT_CPU_WORKERS: 0
2935
- DETECTOR_BATCH_SIZE: 1

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ process = [
5454
"PyMuPDF",
5555
"marker-pdf>=1.6",
5656
"surya-ocr>=0.8.3",
57+
"mistralai>=2.4", # enables MistralOCRProcessor (hosted OCR backend)
5758
"moviepy>=2.0",
5859
"mammoth>=1.8",
5960
"markdownify>=0.12",

src/mmore/process/dispatcher.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,12 @@ class DispatcherConfig:
7474
process_batch_sizes: Optional[List[Dict[str, float]]] = None
7575
batch_multiplier: int = 1
7676
extract_images: bool = False
77+
pdf_backend: Optional[str] = None
7778

7879
def __post_init__(self):
7980
os.makedirs(self.output_path, exist_ok=True)
81+
if self.pdf_backend:
82+
os.environ["MMORE_PDF_BACKEND"] = self.pdf_backend.lower()
8083

8184
@staticmethod
8285
def from_dict(config: Dict) -> "DispatcherConfig":
@@ -90,6 +93,7 @@ def from_dict(config: Dict) -> "DispatcherConfig":
9093
process_batch_sizes=config.get("process_batch_sizes"),
9194
batch_multiplier=config.get("batch_multiplier", 1),
9295
extract_images=config.get("extract_images", False),
96+
pdf_backend=config.get("pdf_backend"),
9397
)
9498

9599
@staticmethod
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
import base64
2+
import io
3+
import logging
4+
import os
5+
import re
6+
from dataclasses import dataclass, field
7+
from typing import Any, Dict, List, Tuple
8+
9+
from PIL import Image
10+
11+
from ...type import DocumentMetadata, FileDescriptor, MultimodalSample
12+
from .base import Processor, ProcessorConfig
13+
14+
logger = logging.getLogger(__name__)
15+
16+
# Env var that selects the PDF backend. When set to "mistral", MistralOCRProcessor
17+
# accepts .pdf files and the default PDFProcessor steps aside.
18+
PDF_BACKEND_ENV = "MMORE_PDF_BACKEND"
19+
MISTRAL_BACKEND = "mistral"
20+
21+
IMG_REGEX = r"!\[[^\]]*\]\([^)]+\)"
22+
23+
24+
@dataclass
25+
class MistralOCRMetadata(DocumentMetadata):
26+
paragraph_starts: List[Tuple[int, int, int]] = field(default_factory=list)
27+
backend: str = "mistral-ocr"
28+
model: str = "mistral-ocr-latest"
29+
30+
def to_dict(self) -> Dict[str, Any]:
31+
metadata = super().to_dict()
32+
if self.paragraph_starts:
33+
metadata["paragraph_starts"] = self.paragraph_starts
34+
metadata["backend"] = self.backend
35+
metadata["model"] = self.model
36+
return metadata
37+
38+
39+
class MistralOCRProcessor(Processor):
40+
"""PDF processor backed by Mistral's hosted OCR endpoint.
41+
42+
Activated by setting MMORE_PDF_BACKEND=mistral. Requires MISTRAL_API_KEY.
43+
"""
44+
45+
def __init__(self, config=None):
46+
super().__init__(config=config or ProcessorConfig())
47+
self._client = None
48+
self._model = (
49+
self.config.custom_config.get("mistral_ocr_model", "mistral-ocr-latest")
50+
if config is not None
51+
else "mistral-ocr-latest"
52+
)
53+
54+
@classmethod
55+
def accepts(cls, file: FileDescriptor) -> bool:
56+
if os.environ.get(PDF_BACKEND_ENV, "").lower() != MISTRAL_BACKEND:
57+
return False
58+
return file.file_extension.lower() == ".pdf"
59+
60+
def _get_client(self):
61+
if self._client is not None:
62+
return self._client
63+
try:
64+
from mistralai import Mistral
65+
except ImportError as e:
66+
raise ImportError(
67+
"mistralai SDK is required for MistralOCRProcessor. "
68+
"Install with `pip install mistralai`."
69+
) from e
70+
api_key = os.environ.get("MISTRAL_API_KEY")
71+
if not api_key:
72+
raise RuntimeError(
73+
"MISTRAL_API_KEY env var is not set. Required for MistralOCRProcessor."
74+
)
75+
self._client = Mistral(api_key=api_key)
76+
return self._client
77+
78+
def process(self, file_path: str) -> MultimodalSample:
79+
client = self._get_client()
80+
81+
with open(file_path, "rb") as fh:
82+
pdf_bytes = fh.read()
83+
encoded = base64.b64encode(pdf_bytes).decode("utf-8")
84+
85+
extract_images = self.config.custom_config.get("extract_images", True)
86+
87+
response = client.ocr.process(
88+
model=self._model,
89+
document={
90+
"type": "document_url",
91+
"document_url": f"data:application/pdf;base64,{encoded}",
92+
},
93+
include_image_base64=extract_images,
94+
)
95+
96+
pages = getattr(response, "pages", None) or []
97+
page_texts: List[Tuple[int, str]] = []
98+
images: List[Image.Image] = []
99+
100+
for page_idx, page in enumerate(pages):
101+
md = getattr(page, "markdown", "") or ""
102+
if extract_images:
103+
for img in getattr(page, "images", []) or []:
104+
b64 = getattr(img, "image_base64", None)
105+
if not b64:
106+
continue
107+
try:
108+
raw = base64.b64decode(b64.split(",", 1)[-1])
109+
images.append(Image.open(io.BytesIO(raw)).convert("RGB"))
110+
except Exception as e:
111+
logger.warning(
112+
f"Could not decode image on page {page_idx} of {file_path}: {e}"
113+
)
114+
md = re.sub(IMG_REGEX, "<attachment>", md)
115+
page_texts.append((page_idx, md))
116+
117+
paragraph_starts, full_text = self._build_pagination(page_texts)
118+
119+
metadata = MistralOCRMetadata(
120+
file_path=file_path,
121+
paragraph_starts=paragraph_starts,
122+
model=self._model,
123+
)
124+
return self.create_sample([full_text], images, metadata)
125+
126+
@staticmethod
127+
def _build_pagination(
128+
page_texts: List[Tuple[int, str]],
129+
) -> Tuple[List[Tuple[int, int, int]], str]:
130+
paragraph_starts: List[Tuple[int, int, int]] = []
131+
current_position = 0
132+
parts: List[str] = []
133+
for page_id, page_content in page_texts:
134+
para_idx = 0
135+
offset_in_page = 0
136+
for segment in page_content.split("\n\n"):
137+
if segment.strip():
138+
paragraph_starts.append(
139+
(current_position + offset_in_page, page_id, para_idx)
140+
)
141+
para_idx += 1
142+
offset_in_page += len(segment) + 2
143+
parts.append(page_content)
144+
current_position += len(page_content)
145+
paragraph_starts.append((current_position, -1, -1))
146+
return paragraph_starts, "".join(parts)

src/mmore/process/processors/pdf_processor.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import io
22
import logging
3+
import os
34
import re
45
from dataclasses import dataclass, field
56
from multiprocessing import Manager, Process, set_start_method
@@ -41,6 +42,8 @@ def __init__(self, config=None):
4142

4243
@classmethod
4344
def accepts(cls, file: FileDescriptor) -> bool:
45+
if os.environ.get("MMORE_PDF_BACKEND", "").lower() == "mistral":
46+
return False
4447
return file.file_extension.lower() == ".pdf"
4548

4649
@staticmethod

uv.lock

Lines changed: 67 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)