feat(process): add MistralOCRProcessor as alternative PDF backend

perrin-arthur · perrin-arthur · commit dc1fe6c8c3be · 2026-05-19T15:44:57.000+02:00
Introduce a hosted-OCR backend (Mistral OCR) selectable at runtime, leaving
the existing Marker/Surya pipeline as the default. Selection is driven by a
new dispatcher_config.pdf_backend flag (marker|mistral) that is exported to
the MMORE_PDF_BACKEND env var so both PDF processors can disambiguate in
their accepts() method.

- src/mmore/process/processors/mistral_ocr_processor.py: new processor
  that calls Mistral OCR via the mistralai SDK, returns markdown +
  optional images, and produces the same MultimodalSample shape as the
  existing PDFProcessor.
- src/mmore/process/processors/pdf_processor.py: accepts() now defers
  when MMORE_PDF_BACKEND=mistral.
- src/mmore/process/dispatcher.py: DispatcherConfig.pdf_backend field
  propagated from YAML.
- production-config/process/config.yaml: documents the new flag and adds
  a MistralOCRProcessor section.
- pyproject.toml: adds mistralai&gt;=2.4 in the `process` extra.
- docs/mistral_ocr_cost_estimate.md: budget rationale for adopting
  Mistral OCR on a 1000-PDF corpus.
diff --git a/docs/mistral_ocr_cost_estimate.md b/docs/mistral_ocr_cost_estimate.md
@@ -0,0 +1,63 @@
+# Budget estimate — Mistral OCR for mmore benchmark
+
+**Goal:** request lab access to a Mistral API key in order to integrate
+`mistral-ocr-latest` as an alternative backend for mmore's `PDFProcessor`, and
+benchmark its extraction quality against the current pipeline (Marker/Surya) on
+a **1000-PDF** corpus.
+
+## Mistral OCR pricing
+
+Source: Mistral AI public pricing (announced March 2025).
+
+| Mode     | Price              | Latency           | Recommended usage           |
+| -------- | ------------------ | ----------------- | --------------------------- |
+| Standard | $1.00 / 1000 pages | near real-time    | dev, debug, small volumes   |
+| Batch    | $0.50 / 1000 pages | a few hours       | offline benchmarks, bulk    |
+
+Assumed conversion: **1 USD ≈ 0.90 CHF** (to be re-confirmed at purchase time).
+
+## Estimates for 1000 PDFs
+
+Cost scales with the number of **pages**, not the number of files. A few
+profiles depending on corpus shape:
+
+| Corpus profile               | Pages/PDF | Total pages | Standard cost | Standard (CHF) | Batch (CHF) |
+| ---------------------------- | --------- | ----------- | ------------- | -------------- | ----------- |
+| Slides, short memos          | 5         | 5,000       | $5            | **~4.5**       | ~2.3        |
+| Mid-sized reports / papers   | 15        | 15,000      | $15           | **~13.5**      | ~6.8        |
+| Long reports (WHO-style)     | 30        | 30,000      | $30           | **~27**        | ~13.5       |
+| Books, theses                | 50        | 50,000      | $50           | **~45**        | ~22.5       |
+| Very long (200-page theses)  | 200       | 200,000     | $200          | **~180**       | ~90         |
+
+## Recommendation for the lab
+
+Target corpus: documents like those in **examples/who/** (WHO reports,
+guidelines), typically 20–80 pages.
+
+- **Central estimate:** ~30 pages × 1000 PDFs = 30,000 pages → **~27 CHF** standard, **~14 CHF** batch.
+- **Recommended budget (×3 margin for re-runs, debug, ablations):** **80–100 CHF**.
+
+## Caveats worth flagging
+
+1. **Per-page billing**, not per-document — measure actual corpus size before purchase.
+2. **Frequent re-runs** during benchmarking (prompt tweaks, hyperparams, sample size): provision ×2 to ×3.
+3. **Floating USD/CHF rate** — re-check on the day of purchase.
+4. **No cost on the current pipeline side** (Marker/Surya runs on lab GPUs, RCP/CSCS).
+
+## Measuring the exact page count before purchase
+
+```bash
+find <corpus_dir> -name "*.pdf" -exec pdfinfo {} \; \
+  | grep "^Pages" | awk '{s+=$2} END {print "Total pages:", s}'
+```
+
+Estimated cost = `total_pages * 0.001 * 0.9` CHF in standard mode,
+or `total_pages * 0.0005 * 0.9` CHF in batch mode.
+
+## Summary / request
+
+> To integrate and benchmark Mistral OCR (`mistral-ocr-latest`) as an
+> alternative backend for mmore's PDFProcessor on a 1000-document corpus
+> (~30,000 pages estimated), I'm requesting access to a Mistral API key with a
+> projected budget of **~100 CHF** covering the initial run plus 2–3 bench
+> iterations (prompt changes, ablations on sub-corpora).
diff --git a/production-config/process/config.yaml b/production-config/process/config.yaml
@@ -5,6 +5,9 @@ dispatcher_config:
   distributed: false
   extract_images: true
   scheduler_file: $ROOT_OUT_DIR/scheduler-file.json #put absolute path!
+  # PDF backend: "marker" (default, local GPU via Marker/Surya) or "mistral"
+  # (hosted Mistral OCR API, requires MISTRAL_API_KEY env var).
+  pdf_backend: marker
   process_batch_sizes:
     - URLProcessor: 40
     - DOCXProcessor: 100
@@ -24,6 +27,9 @@ dispatcher_config:
       - sample_rate: 10
       - batch_size: 4
 
+    MistralOCRProcessor:
+      - mistral_ocr_model: "mistral-ocr-latest"
+
     PDFProcessor:
       - PDFTEXT_CPU_WORKERS: 0
       - DETECTOR_BATCH_SIZE: 1
diff --git a/pyproject.toml b/pyproject.toml
@@ -54,6 +54,7 @@ process = [
     "PyMuPDF",
     "marker-pdf>=1.6",
     "surya-ocr>=0.8.3",
+    "mistralai>=2.4",  # enables MistralOCRProcessor (hosted OCR backend)
     "moviepy>=2.0",
     "mammoth>=1.8",
     "markdownify>=0.12",
diff --git a/src/mmore/process/dispatcher.py b/src/mmore/process/dispatcher.py
@@ -74,9 +74,12 @@ class DispatcherConfig:
     process_batch_sizes: Optional[List[Dict[str, float]]] = None
     batch_multiplier: int = 1
     extract_images: bool = False
+    pdf_backend: Optional[str] = None
 
     def __post_init__(self):
         os.makedirs(self.output_path, exist_ok=True)
+        if self.pdf_backend:
+            os.environ["MMORE_PDF_BACKEND"] = self.pdf_backend.lower()
 
     @staticmethod
     def from_dict(config: Dict) -> "DispatcherConfig":
@@ -90,6 +93,7 @@ def from_dict(config: Dict) -> "DispatcherConfig":
             process_batch_sizes=config.get("process_batch_sizes"),
             batch_multiplier=config.get("batch_multiplier", 1),
             extract_images=config.get("extract_images", False),
+            pdf_backend=config.get("pdf_backend"),
         )
 
     @staticmethod
diff --git a/src/mmore/process/processors/mistral_ocr_processor.py b/src/mmore/process/processors/mistral_ocr_processor.py
@@ -0,0 +1,146 @@
+import base64
+import io
+import logging
+import os
+import re
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Tuple
+
+from PIL import Image
+
+from ...type import DocumentMetadata, FileDescriptor, MultimodalSample
+from .base import Processor, ProcessorConfig
+
+logger = logging.getLogger(__name__)
+
+# Env var that selects the PDF backend. When set to "mistral", MistralOCRProcessor
+# accepts .pdf files and the default PDFProcessor steps aside.
+PDF_BACKEND_ENV = "MMORE_PDF_BACKEND"
+MISTRAL_BACKEND = "mistral"
+
+IMG_REGEX = r"!\[[^\]]*\]\([^)]+\)"
+
+
+@dataclass
+class MistralOCRMetadata(DocumentMetadata):
+    paragraph_starts: List[Tuple[int, int, int]] = field(default_factory=list)
+    backend: str = "mistral-ocr"
+    model: str = "mistral-ocr-latest"
+
+    def to_dict(self) -> Dict[str, Any]:
+        metadata = super().to_dict()
+        if self.paragraph_starts:
+            metadata["paragraph_starts"] = self.paragraph_starts
+        metadata["backend"] = self.backend
+        metadata["model"] = self.model
+        return metadata
+
+
+class MistralOCRProcessor(Processor):
+    """PDF processor backed by Mistral's hosted OCR endpoint.
+
+    Activated by setting MMORE_PDF_BACKEND=mistral. Requires MISTRAL_API_KEY.
+    """
+
+    def __init__(self, config=None):
+        super().__init__(config=config or ProcessorConfig())
+        self._client = None
+        self._model = (
+            self.config.custom_config.get("mistral_ocr_model", "mistral-ocr-latest")
+            if config is not None
+            else "mistral-ocr-latest"
+        )
+
+    @classmethod
+    def accepts(cls, file: FileDescriptor) -> bool:
+        if os.environ.get(PDF_BACKEND_ENV, "").lower() != MISTRAL_BACKEND:
+            return False
+        return file.file_extension.lower() == ".pdf"
+
+    def _get_client(self):
+        if self._client is not None:
+            return self._client
+        try:
+            from mistralai import Mistral
+        except ImportError as e:
+            raise ImportError(
+                "mistralai SDK is required for MistralOCRProcessor. "
+                "Install with `pip install mistralai`."
+            ) from e
+        api_key = os.environ.get("MISTRAL_API_KEY")
+        if not api_key:
+            raise RuntimeError(
+                "MISTRAL_API_KEY env var is not set. Required for MistralOCRProcessor."
+            )
+        self._client = Mistral(api_key=api_key)
+        return self._client
+
+    def process(self, file_path: str) -> MultimodalSample:
+        client = self._get_client()
+
+        with open(file_path, "rb") as fh:
+            pdf_bytes = fh.read()
+        encoded = base64.b64encode(pdf_bytes).decode("utf-8")
+
+        extract_images = self.config.custom_config.get("extract_images", True)
+
+        response = client.ocr.process(
+            model=self._model,
+            document={
+                "type": "document_url",
+                "document_url": f"data:application/pdf;base64,{encoded}",
+            },
+            include_image_base64=extract_images,
+        )
+
+        pages = getattr(response, "pages", None) or []
+        page_texts: List[Tuple[int, str]] = []
+        images: List[Image.Image] = []
+
+        for page_idx, page in enumerate(pages):
+            md = getattr(page, "markdown", "") or ""
+            if extract_images:
+                for img in getattr(page, "images", []) or []:
+                    b64 = getattr(img, "image_base64", None)
+                    if not b64:
+                        continue
+                    try:
+                        raw = base64.b64decode(b64.split(",", 1)[-1])
+                        images.append(Image.open(io.BytesIO(raw)).convert("RGB"))
+                    except Exception as e:
+                        logger.warning(
+                            f"Could not decode image on page {page_idx} of {file_path}: {e}"
+                        )
+            md = re.sub(IMG_REGEX, "<attachment>", md)
+            page_texts.append((page_idx, md))
+
+        paragraph_starts, full_text = self._build_pagination(page_texts)
+
+        metadata = MistralOCRMetadata(
+            file_path=file_path,
+            paragraph_starts=paragraph_starts,
+            model=self._model,
+        )
+        return self.create_sample([full_text], images, metadata)
+
+    @staticmethod
+    def _build_pagination(
+        page_texts: List[Tuple[int, str]],
+    ) -> Tuple[List[Tuple[int, int, int]], str]:
+        paragraph_starts: List[Tuple[int, int, int]] = []
+        current_position = 0
+        parts: List[str] = []
+        for page_id, page_content in page_texts:
+            para_idx = 0
+            offset_in_page = 0
+            for segment in page_content.split("\n\n"):
+                if segment.strip():
+                    paragraph_starts.append(
+                        (current_position + offset_in_page, page_id, para_idx)
+                    )
+                    para_idx += 1
+                offset_in_page += len(segment) + 2
+            parts.append(page_content)
+            current_position += len(page_content)
+        paragraph_starts.append((current_position, -1, -1))
+        return paragraph_starts, "".join(parts)
diff --git a/src/mmore/process/processors/pdf_processor.py b/src/mmore/process/processors/pdf_processor.py
@@ -1,5 +1,6 @@
 import io
 import logging
+import os
 import re
 from dataclasses import dataclass, field
 from multiprocessing import Manager, Process, set_start_method
@@ -41,6 +42,8 @@ def __init__(self, config=None):
 
     @classmethod
     def accepts(cls, file: FileDescriptor) -> bool:
+        if os.environ.get("MMORE_PDF_BACKEND", "").lower() == "mistral":
+            return False
         return file.file_extension.lower() == ".pdf"
 
     @staticmethod
diff --git a/uv.lock b/uv.lock