Bibby AI detects 91.4% of LaTeX errors before they silently break your paper.
13 points ahead of OpenAI Prism. 30 points ahead of Overleaf.
Bibby AI β AI-native LaTeX editor with real-time error detection, smart citation search, and live PDF preview. No plugins. No copy-paste. Everything in one place.
LaTeXBench-500 is the first standardised benchmark for LaTeX compilation error detection and one-click repair, introduced in our arXiv paper.
| Tool | Detection Accuracy (DA%) | Fix Accuracy (FA%) | Pre-compilation? |
|---|---|---|---|
| π₯ Bibby AI | 91.4% | 83.7% | β Yes |
| π₯ OpenAI Prism | 78.3% | 64.1% | Partial |
| π₯ Overleaf (native) | 61.2% | β (no auto-fix) | β No |
| Error Category | Count | Bibby DA% | Prism DA% | Overleaf DA% |
|---|---|---|---|---|
| Undefined control sequences | 112 | 94.6% | 81.2% | 68.3% |
| Math mode errors | 98 | 92.8% | 79.4% | 63.1% |
| Table & figure errors | 86 | 90.1% | 77.9% | 59.4% |
| Reference errors | 79 | 91.2% | 78.8% | 61.7% |
| Package conflicts | 74 | 88.4% | 74.6% | 54.2% |
| Encoding & font errors | 51 | 87.3% | 72.1% | 52.8% |
| Total / Average | 500 | 91.4% | 78.3% | 61.2% |
DA% = Detection Accuracy β correct identification of error type AND location
FA% = Fix Accuracy β suggested fix produces clean, semantically correct compilation
500 authentic LaTeX compilation errors drawn from real-world arXiv preprints, across 6 error categories, each with:
- Ground-truth error location (file + line number)
- Error category label
- Verified correct fix
- Compilation validation (before and after fix)
All errors were silently failing β i.e., the document compiled without crashing but produced incorrect output. This is the hardest and most practically relevant class of LaTeX errors.
bibby-latex-benchmark/
βββ assets/
β βββ bibby-mascot.png β Bibby mascot
β βββ bibby-editor-screenshot.png β Editor UI
βββ benchmark/
β βββ corpus/ β 500 LaTeX documents
β βββ ground_truth/ β Annotated error locations
β βββ error_categories.md β Full taxonomy
βββ evaluation/
β βββ metrics.py β DA% and FA% calculation
β βββ run_benchmark.py β Main runner
β βββ results/ β Raw results per tool
βββ analysis/
β βββ figures/ β All paper figures (reproducible)
β βββ notebooks/ β Jupyter analysis notebooks
βββ BENCHMARK.md β How to run on a new tool
βββ CONTRIBUTING.md
βββ README.md
pip install -r requirements.txt
# Requires: Python 3.10+, latexmk, biberpython evaluation/run_benchmark.py \
--tool bibby \
--corpus benchmark/corpus/ \
--output evaluation/results/my_run/
# --tool options: bibby | prism | overleaf | custompython evaluation/metrics.py \
--results evaluation/results/my_run/ \
--ground-truth benchmark/ground_truth/jupyter notebook analysis/notebooks/paper_figures.ipynbThree architectural reasons Bibby AI's error detection is fundamentally different:
1. AST-grounded localisation
Bibby maintains a live Abstract Syntax Tree of your document. When compiler logs point to line 847, Bibby traces back through the AST to find the actual source β which is often 20 lines earlier. Other tools trust the log line number blindly.
2. Package-aware reasoning
Bibby's error model is conditioned on curated documentation for 2,000+ LaTeX packages. When \pgfplotsset fails, Bibby knows whether you're missing a \usetikzlibrary call vs. using a deprecated option β not just that something broke.
3. Validated fix generation
Every suggested fix is compiled and validated before being shown to you. Bibby never surfaces a fix that doesn't actually work.
Bibby AI is used by researchers at:
| Institution | Use Case |
|---|---|
| Simons Foundation | Mathematical research papers |
| Allen Institute | Neuroscience & biology publications |
| Yale University | Academic dissertation writing |
β Try Bibby AI free at trybibby.com
No credit card. No installation. Open in your browser and start writing.
If you use LaTeXBench-500 in your research, please cite:
@misc{jain2026bibby,
title = {Bibby AI β AI LaTeX Editor writing assistant for researchers
vs Overleaf Alternative vs OpenAI Prism},
author = {Jain, Nilesh and others},
year = {2026},
eprint = {2602.16432},
archivePrefix = {arXiv},
primaryClass = {cs.DL},
url = {https://arxiv.org/abs/2602.16432}
}We welcome:
- New tool evaluations β run the benchmark on any tool and submit results via PR
- Additional error categories β open an issue to propose new LaTeX error types
- Corpus extensions β more arXiv-derived documents with ground-truth annotations
See CONTRIBUTING.md for guidelines.
Benchmark code & evaluation scripts: MIT License
Corpus documents: Derived from arXiv papers under their respective CC licenses
Paper: CC BY 4.0