Skip to content

Call for parser submissions: DocFailBench v0.1 Combined Public RC #1

@Travor278

Description

@Travor278

DocFailBench v0.1 Combined Public RC is now open for external parser submissions.

DocFailBench is a failure-oriented benchmark for PDF-to-Markdown, OCR, and VLM document parsers. Instead of reporting only page-level similarity, it checks small executable facts: table cells, formulas, reading order, captions, page furniture, and optional bbox grounding.

Target release

Please use the frozen combined public RC unless you have a specific reason to target a smaller subset:

Current baseline snapshot

Parser Passed Failed Score
Marker 621 256 0.7081
PyMuPDF bbox 612 265 0.6978
Docling 599 278 0.6830
PyMuPDF plain 589 288 0.6716
Qwen-VL API 559 318 0.6374
MinerU 496 381 0.5656
PaddleOCR 334 543 0.3808

What to submit

A useful submission should include:

  • parser name and version,
  • installation notes or environment file,
  • exact command used to generate predictions,
  • prediction JSON,
  • result JSON from docfailbench.cli evaluate,
  • hardware/OS/runtime metadata,
  • model/API name and run date for hosted or moving-target parsers,
  • optional raw Markdown outputs for failed cases.

Recommended evaluation command:

python -m docfailbench.cli evaluate `
  --cases data/releases/docfailbench_v0_1_combined_public_rc_cases.json `
  --predictions path/to/your_predictions.json `
  --out runs/submissions/YOUR_PARSER/combined_public_rc_results.json

If you add an adapter, start from examples/parser_manifest.json and run:

python -m docfailbench.cli baseline `
  --manifest examples/parser_manifest.json `
  --parser your_parser `
  --cases data/releases/docfailbench_v0_1_combined_public_rc_cases.json `
  --out runs/submissions/your_parser/predictions.json `
  --results runs/submissions/your_parser/results.json `
  --html runs/submissions/your_parser/report.html

Full guide: https://github.com/Travor278/DocFailBench/blob/main/docs/submitting-parser-results.md

Review policy

Results can be listed in the README when:

  • they target a frozen case file,
  • predictions cover all target cases,
  • parser version and run command are clear,
  • no private PDFs, API keys, or proprietary raw outputs are included,
  • hosted API results include endpoint family, requested model, and run date.

Maintainers may mark entries as unverified until reproduced locally.

If you maintain a PDF parser, table extractor, OCR system, or VLM document parser, please try the benchmark and post your results here or open a PR. The failures are the point: they tell us exactly which facts broke.

Metadata

Metadata

Assignees

No one assigned

    Labels

    benchmarkBenchmark data, scoring, and releasescommunityCommunity discussion and contributionsubmissionExternal parser result submission

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions