Skip to content

Commit a6281dc

Browse files
davanstrienclaude
andauthored
Cite Datalab as inspiration, prominent viewer link (#12)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 77ae196 commit a6281dc

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ OCR model evaluation toolkit. Answers: **"Which OCR model works best for MY docu
44

55
Rankings change by document type — the best model for manuscript cards is different from the best for printed books or historical texts. This tool creates per-collection leaderboards using pairwise VLM-as-judge comparisons, so users can find what works for their specific documents.
66

7+
Inspired by [Datalab's Benchmarks + Evals](https://www.datalab.to/blog/datalab-benchmarks-evals) — pairwise VLM-as-judge with Bradley-Terry scoring per document class — but as an open-source, Hub-native tool anyone can run on their own collections.
8+
79
**Pipeline**: `run` (launch OCR models via HF Jobs) → `judge` (pairwise VLM comparison → Bradley-Terry ELO) → `view` (leaderboard + human validation). Everything lives on the Hugging Face Hub — no local GPU needed.
810

911
## Architecture

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The underlying OCR model inference uv scripts are available at [uv-scripts/ocr](
88

99
## Why?
1010

11-
Generic OCR benchmarks tell you which model wins _on average_. But if you're digitising 18th-century encyclopaedias, that average doesn't help — the best model for your documents might be the worst on someone else's.
11+
Generic OCR benchmarks tell you which model wins _on average_. But if you're digitising 18th-century encyclopaedias, that average doesn't help — the best model for your documents might be the worst on someone else's. Inspired by [Datalab's Benchmarks + Evals](https://www.datalab.to/blog/datalab-benchmarks-evals) approach — pairwise VLM-as-judge with Bradley-Terry scoring on your own documents — ocr-bench brings this idea to the Hugging Face Hub as an open-source, self-serve tool.
1212

1313
ocr-bench lets you run the same set of OCR models on a sample of _your_ collection, then uses a vision-language model to judge which produces the best transcription for each document. The result is a leaderboard specific to your data.
1414

@@ -24,6 +24,8 @@ Rankings can flip completely between collections.
2424

2525
![ELO vs Parameter Count — smaller models can win on the right documents](assets/elo-scatter.png)
2626

27+
**[Try the live viewer](https://huggingface.co/spaces/davanstrien/ocr-bench-britannica-results-qwen35-viewer)** — browse the Britannica 1771 leaderboard, compare OCR outputs side-by-side, and vote on quality yourself.
28+
2729
## Hub-native by design
2830

2931
The entire evaluation loop lives on the Hugging Face Hub:

0 commit comments

Comments
 (0)