There is currently no single best OCR model. Rankings change depending on your documents. Manuscript cards, printed books, historical texts all produce different winners.
ocr-bench allows you to create per-collection leaderboards using a VLM-as-judge approach, so you can find what works best for your documents rather than relying on generic benchmarks. You can validate the VLM's judgement with human votes, and share results via the Hugging Face Hub.
The underlying OCR model inference uv scripts are available at uv-scripts/ocr. The majority of these use vLLM for efficient GPU inference, and are designed to run on a single consumer GPU (e.g. 24GB 3090/4090). The ocr-bench package orchestrates running these models at scale on the Hub, and judging outputs with a VLM. If you just want to run some OCR models on your data without the judging/leaderboard aspect, you can run the scripts directly.
Generic OCR benchmarks tell you which model wins on average. But if you're digitising 18th-century encyclopaedias, that average doesn't help — the best model for your documents might be the worst on someone else's. Inspired by Datalab's Benchmarks + Evals approach — pairwise VLM-as-judge with Bradley-Terry scoring on your own documents — ocr-bench brings this idea to the Hugging Face Hub as an open-source, self-serve tool.
ocr-bench lets you run the same set of OCR models on a sample of your collection, then uses a vision-language model to judge which produces the best transcription for each document. The result is a leaderboard specific to your data.
| Model | BPL card catalog | Britannica 1771 |
|---|---|---|
| GLM-OCR (0.9B) | #2 (1535) | #1 (1787) |
| LightOnOCR-2 (1B) | #1 (1559) | #2 (1780) |
| FireRed-OCR (2.1B) | — | #3 (1551) |
| DeepSeek-OCR (4B) | #4 (1452) | #4 (1437) |
| dots.ocr (1.7B) | #3 (1453) | #5 (945) |
Rankings can flip completely between collections.
Try the live viewer — browse the Britannica 1771 leaderboard, compare OCR outputs side-by-side, and vote on quality yourself.
The entire evaluation loop lives on the Hugging Face Hub:
- Your dataset on the Hub (images + optional ground truth)
- OCR models run via HF Jobs → outputs written as PRs on a Hub dataset
- VLM judge via HF Inference Providers — only needs an HF token
- Results published to a Hub dataset (leaderboard + pairwise comparisons)
- Viewer as a HF Space for browsing and human validation
No local GPU required. Everything is shareable via Hub URLs.
uv pip install ocr-bench[viewer]
# 1. Run OCR models on your dataset
ocr-bench run <input-dataset> <output-repo> --max-samples 50
# 2. Judge outputs pairwise with a VLM
ocr-bench judge <output-repo>
# 3. Browse results + validate
ocr-bench view <output-repo>-resultsocr-bench run launches OCR models on your dataset via HF Jobs. Each model writes its output as a PR on the same Hub dataset, keeping everything together without merge conflicts.
ocr-bench judge runs pairwise comparisons using a VLM judge (default: Qwen3.5-35B-A3B via HF Inference Providers). For each document, the judge sees the original image and two OCR outputs (anonymised as A/B) and picks the better transcription. Results are fit to a Bradley-Terry model to produce ELO ratings with bootstrap 95% confidence intervals. Adaptive stopping halts early when rankings are statistically resolved.
ocr-bench view serves a local web viewer with a leaderboard, comparison browser, and human validation. Vote on comparisons to cross-check the automated judge with human judgement.
ocr-bench ships with 5 OCR models ready to run:
| Model | Size | Best for | Notes |
|---|---|---|---|
glm-ocr |
0.9B | Historical printed text | Top performer on Britannica |
lighton-ocr-2 |
1B | Card catalogs, manuscripts | Top performer on BPL |
firered-ocr |
2.1B | Clean printed text | Mid-pack on degraded docs |
deepseek-ocr |
4B | Diverse documents | Most consistent across types |
dots-ocr |
1.7B | General | Struggles on historical text |
All model scripts are available at uv-scripts/ocr on the Hub.
By default all 5 run. To pick specific models:
ocr-bench run <dataset> <output> --models glm-ocr lighton-ocr-2Browse these on the Hub:
- davanstrien/ocr-bench-britannica-results-qwen35 — Encyclopaedia Britannica 1771, 5 models, 50 samples
- davanstrien/bpl-ocr-bench-results — Boston Public Library card catalog, 4 models, 150 samples
- Live viewer — Britannica leaderboard with ELO chart and comparison browser
uv pip install ocr-bench # Core (run + judge)
uv pip install ocr-bench[viewer] # With web UIOr with uv:
uv pip install ocr-bench[viewer]Requires Python >= 3.11 and an HF token.
Working proof of concept. The core pipeline (run → judge → view) is functional. Not polished production software — expect rough edges. This is an early-stage project to explore the idea of VLM-judged OCR leaderboards, and gather feedback on the concept and implementation!

