You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove internal dev state, completed checklists, detailed findings,
and Hub dataset inventory. Keep architecture, tooling, design decisions,
known limitations, and roadmap.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
OCR model evaluation toolkit. VLM-as-judge with per-dataset leaderboards on Hugging Face Hub.
4
4
5
-
> Historical decisions, smoke tests, and completed phase details are in [ARCHIVE.md](ARCHIVE.md).
6
-
> PoC release plan and checklist in [POC-CHECKLIST.md](POC-CHECKLIST.md).
7
-
8
-
## What This Project Does
9
-
10
-
Lets anyone answer: **"Which OCR model works best for MY documents?"**
11
-
12
-
Rankings change by document type — manuscript cards, printed books, historical texts, tables all produce different winners. This tool creates per-collection leaderboards.
13
-
14
-
## Current State (2026-03-02)
15
-
16
-
**233 tests passing**, ruff clean. Full pipeline works with smart defaults:
17
-
18
-
```
19
-
ocr-bench run <input-ds> <output-repo> --max-samples 50
20
-
ocr-bench judge <output-repo>
21
-
ocr-bench view <output-repo>-results
22
-
```
23
-
24
-
Smart defaults: auto-detects PR + main branch configs, auto-derives results repo name, adaptive stopping on by default. Zero flags needed for the common case.
-`pip install ocr-bench[viewer]` — FastAPI + uvicorn + jinja2 (the web viewer)
49
-
50
-
## Next Steps
51
-
52
-
### Immediate — Publication + polish
53
-
-[x] Incremental judge mode: `--save-results` to existing repo loads comparisons, skips judged pairs, judges only new ones, merges and refits BT-MLE. `--full-rejudge` to override.
-[x] Switch default judge to Qwen3.5-35B-A3B (fastest, zero parse failures, same cluster rankings as 122B and 27B)
68
-
69
-
### Known limitation — row alignment across PRs
70
-
`load_config_dataset()` merges configs by positional index — no alignment key. Safe if all model runs use the same `--seed`/`--max-samples` and the source dataset doesn't change. Future: add content hash column for validation.
71
-
72
-
### Phase 4: Blog + Visibility
73
-
-[ ] "There Is No Best OCR Model" blog post
74
-
-[ ] Cross-link repo, viewer, Hub datasets, blog
75
-
76
-
### Phase 5: Customization
77
-
-[ ] Judge prompt presets for GLAM document types
78
-
-[ ] Custom prompt and ignore list support
79
-
-[ ] Define leaderboard dataset schema
80
-
-[x] Adaptive stopping (on by default, `--no-adaptive` to opt out): run batches, compute BT-MLE + CIs, stop when adjacent-rank CIs don't overlap (ranking is statistically resolved). Avoids wasting judge calls when rankings are already clear.
81
-
-[x] Judge prompt: hallucination penalized more than commentary. Markdown formatting neutral. Blank page descriptions no longer lose to nonsense output.
82
-
-[ ] Judge comparison: run same dataset through different judges (e.g. Kimi K2.5 vs Qwen3.5-397B), compare BT-MLE ratings + CIs to see where judges agree/disagree. Overlapping CIs = single judge is fine; non-overlapping = jury mode adds value. Test on diverse document types — jury may only matter for ambiguous collections (e.g. index cards where everything ties).
83
-
-[ ]`--focus-pairs` for human validation: prioritize showing pairs with overlapping CIs in the vote UI, since those are the only ones where human input changes the ranking.
84
-
-[ ] Blank page filtering: skip comparisons where neither model produced meaningful text.
85
-
86
-
### If project gets traction
87
-
-[ ] Consolidate OCR model scripts into this repo + hub-sync
88
-
-[ ] CI/smoke tests
89
-
-[x] Britannica E2E (50 samples, 4 models, 2 judges)
90
-
-[x] Large-scale run: full BPL card catalog (453K images) with GLM-OCR — 21.9 hrs, $39 on L40S
91
-
-[ ] Large-scale runs (full Britannica, NLS index cards)
92
-
-[ ] CER/WER metrics alongside VLM judge
93
-
-[ ]`bench` command: single `ocr-bench bench <input-dataset>` chains run → judge → view
94
-
95
20
## Tooling
96
21
97
22
-**uv** for project management and running scripts
98
23
-**ruff** for linting and formatting
24
+
- Release process documented in [RELEASING.md](RELEASING.md)
99
25
100
-
## Technical Reference
26
+
## Development
27
+
28
+
```bash
29
+
uv sync --dev --extra viewer
30
+
uv run ruff check src/ tests/
31
+
uv run pytest tests/ -x -q
32
+
```
101
33
102
-
### Judge Models
103
-
-**Qwen3.5-35B-A3B (`novita:Qwen/Qwen3.5-35B-A3B`)** — **default**. Zero parse failures, ~21 comps/min via Inference Providers (HF token only). Same cluster rankings as 122B and 27B. Best speed/quality tradeoff.
104
-
-**Qwen3.5-122B-A10B (`novita:Qwen/Qwen3.5-122B-A10B`)** — zero parse failures, ~19 comps/min. Slightly more separation between clusters. Good for authoritative runs.
105
-
-**Qwen3.5-27B (`novita:Qwen/Qwen3.5-27B`)** — zero parse failures, ~12 comps/min. Dense model, slower than MoE alternatives. Same clusters.
106
-
-**Kimi K2.5 (`novita:moonshotai/Kimi-K2.5`)** — best human agreement but 2-5% parse failures from degeneration. No longer default.
107
-
-**Qwen3-VL-235B (`novita:Qwen/Qwen3-VL-235B-A22B-Instruct`)** — zero parse failures, ~6 comps/min. Good but slower and Novita disconnects on long runs.
108
-
-**Qwen3-VL-30B-A3B (offline vLLM)** — best offline judge
109
-
-**7B/8B** — biased toward verbose output, not recommended as primary
34
+
Branch protection is on — all changes go through PRs with CI checks.
110
35
111
-
### Core Benchmark Models
112
-
| Model | Size | Best on |
113
-
|-------|------|---------|
114
-
| DeepSeek-OCR | 4B | Most consistent across datasets |
| FireRed-OCR | 2.1B | Mid-pack on Britannica (#3), good on clean printed text, loses on degraded/stamps |
118
-
| dots.ocr | 1.7B | Worst on Britannica (1-2% win rate) |
36
+
## Key design decisions
119
37
120
-
### Key Findings
121
-
1.**No single best OCR model** — rankings shuffle by document type
122
-
2.**DeepSeek-OCR most consistent** — #1 on UFO (diverse docs), but #3 on BPL and Britannica
123
-
3.**LightOnOCR-2 best on BPL** — #1 (Kimi) or #2 (Qwen3-VL) on card catalogs; #1 on Britannica (35B judge), #2 (235B judge)
124
-
4.**GLM-OCR best on Britannica (235B)** — #1 (1779 ELO), but #2 with 35B judge. Strong on 18th century printed text.
125
-
5.**Document type > model size** — 0.9B beats 4B on some collections
126
-
6.**Judge model size matters** — 170B closest to human rankings
127
-
7.**Judges agree on clusters, swap within** — Kimi K2.5 and Qwen3-VL-235B produce same top-2/bottom-2 groupings on BPL but swap adjacent models. CIs overlap between judges, confirming fine ordering is noise. Same pattern on Britannica: 235B and 35B agree on clusters but swap #1/#2.
9.**dots.ocr struggles on historical printed text** — 1-2% win rate on Britannica vs 55% on BPL. Model-dataset fit matters enormously.
130
-
10.**FireRed-OCR mid-pack on Britannica** — #3 (1551 ELO, 35B judge). Loses to GLM/LightOn on degraded text (garbling) and gets penalised for aggressive markdown formatting (`# headings`, `**bold**` on everything). Beats DeepSeek-OCR (52% vs 39% win rate).
131
-
11.**Qwen3.5-35B-A3B is the best default judge** — assessed 35B-A3B, 27B, and 122B-A10B on identical 96-comparison Britannica benchmark. All three: zero parse failures, same cluster rankings (top-3: FireRed/LightOn/GLM, then DeepSeek, then dots). 35B-A3B fastest (4:35 vs 7:55 vs 4:57). Now the default.
-**Row alignment across configs is positional only** — `load_config_dataset()` merges by index. Safe if all model runs use the same `--seed`/`--max-samples` and source dataset doesn't change. Future: add content hash column.
46
+
-**Blank page filtering** not yet implemented — wastes judge calls when neither model produced meaningful text.
150
47
151
-
## Connections
48
+
## Roadmap
152
49
153
-
-**uv-scripts/ocr on Hub**: OCR model scripts stay there for now
154
-
-**FineCorpus**: OCR quality = training data quality
155
-
-**NLS**: Index cards as flagship benchmark dataset
50
+
- Blog post: "There Is No Best OCR Model"
51
+
- Judge prompt presets for GLAM document types
52
+
- Custom prompt and ignore list support
53
+
- Judge comparison across different judge models
54
+
-`--focus-pairs`: prioritize overlapping-CI pairs in validation UI
55
+
- CER/WER metrics alongside VLM judge
56
+
-`bench` command: single `ocr-bench bench <input-dataset>` chains run → judge → view
0 commit comments