DocFailBench now has a frozen combined public release candidate:
DocFailBench-v0.1-combined-public-rc. This file tracks source policy and
future acquisition work, especially more non-government public PDFs for the next
community release.
- Prefer sources with stable URLs, clear redistribution terms, and community use.
- Store source URL, license, page number, checksum, and acquisition date for every imported PDF.
- Keep public pages small and auditable; one PDF can contribute several page-level cases, but every assertion must point to visible page evidence.
- Do not publish private contracts, invoices, student work, or company reports.
- If license terms are uncertain, keep only metadata and download instructions.
| Source | Why It Is Useful | License / Risk | Best Assertion Types | Repo Policy |
|---|---|---|---|---|
| NIST Technical Series / CSRC reports | Stable government technical reports with numbered sections, tables, appendices, and references. | Low risk for many federal publications; still check report-specific notices and cite source. | reading_order, tables, footers, numbered sections. |
High-priority direct inclusion candidate. |
| BLS reports and news-release PDFs | Real economic reports with dense statistical tables, footnotes, and repeated headers. | Low risk for many BLS publications; cite source and check document notices. | Dense tables, table footnotes, header/footer pollution. | High-priority direct inclusion candidate. |
| IRS forms and instructions | Real forms with boxes, field labels, tables, checkboxes, and instruction pages. | U.S. federal government works are generally public domain, but verify form-specific notices. | Forms, checkboxes, table cells, field labels. | High-priority direct inclusion candidate. |
| GovInfo / GPO official PDFs | Official long documents with multi-column legal text, numbered sections, and page furniture. | Usually low risk for U.S. government publications; exclude third-party attachments if any. | reading_order, layout, table checks, header/footer pollution. |
Good direct inclusion source for selected pages. |
| DocLayNet | Community-known layout benchmark with 80,863 human-annotated pages and single-page PDF extras. | Official repo license is CDLA-Permissive-1.0; preserve attribution and license text. | reading_order, element_grounded, layout-region checks, captions. |
Strong source, but avoid pulling the full 35+ GB dataset into repo. |
| PubTables-1M | Standard table-structure benchmark with nearly one million annotated tables and PDF/image coordinates. | Official code repo is MIT; dataset is hosted through Microsoft Research Open Data / Hugging Face. Verify dataset terms before bundling files. | table_grid_cell, table_cell_exists, table_shape. |
Use as table target source; keep URLs/checksums. |
| PMC Open Access Subset | Large public biomedical PDF corpus with license metadata per article and official retrieval services. | Licenses vary by article; PMC groups terms into commercial, non-commercial, and other. Only bundle CC BY/CC0 or similarly permissive pages. | Tables, figures, captions, formulas, reading order. | Good candidate for redistribution if article license allows it. |
| ACL Anthology papers | Stable NLP paper PDFs; post-2016 ACL materials are CC BY 4.0 according to ACL Anthology FAQ. | Check each paper page and avoid third-party materials not covered by ACL policy. | Chinese-English mixed papers, formulas, references, double-column reading order. | Already used for two real papers; expand with license metadata. |
| arXiv papers | Stable PDFs and broad coverage of formulas/tables. | arXiv records can use different licenses; do not assume redistribution. | formula_contains, reading_order, table checks. |
Prefer source URLs and fetch script unless license is permissive. |
| OpenStax textbooks | High-quality textbook PDFs with openly licensed content. | OpenStax publishes under Creative Commons licenses; many books are CC BY, some are CC BY-NC-SA. Verify book-specific notice. | Dense textbook prose, formulas, figures, tables. | Strong candidate for direct inclusion or scripted fetch. |
| SEC EDGAR filings | Real annual reports and financial statements with dense tables. | Medium-high redistribution risk: public SEC access does not automatically make issuer PDFs freely redistributable. | Financial table cells, footers, section reading order. | Prefer source URLs and checksums, not bundled PDFs. |
| EU / UN public reports | Stable, professional reports with multilingual tables. | Licenses vary by institution and document. | Multilingual tables, footers, captions. | Metadata first; include only permissive documents. |
| OpenReview / conference PDFs | Modern ML paper layouts with figures and tables. | Licenses vary; check page terms. | Paper reading order, captions, tables, formulas. | Metadata first unless license is explicit. |
Aim for 20-40 additional non-government public-real pages before promoting the next community release beyond the current combined RC:
| Domain | Pages | Preferred Sources |
|---|---|---|
| Scientific tables | 6-10 | PubTables-1M / PMC OA |
| Formula-heavy textbook or paper | 4-8 | OpenStax / arXiv permissive / OpenIntro |
| Chinese-English papers | 4-8 | ACL Anthology / arXiv permissive |
| Financial/statistical dense tables | 4-8 | BLS first, SEC EDGAR metadata-first |
| Public statistical / financial style reports | 4-6 | BLS first; SEC EDGAR metadata-first |
Government forms and technical/legal reports are already represented in the public-real RC through IRS, NIST, and GovInfo pages. Add more government pages only for balancing, not as the next expansion emphasis.
Each imported public case should include:
document.source_urldocument.licensedocument.sha256document.pageprofile.document_typeprofile.layoutprofile.source_kind, one ofreal_public,synthetic,private_local- Review note linking the assertion to page evidence.
The frozen diagnostic candidate has 54 cases / 506 assertions:
- 4 real public-PDF cases / 46 assertions
- 35 synthetic or placeholder cases / 360 assertions
- 15 controlled synthetic Stage6 Batch2 cases / 100 assertions
This remains useful for controlled regression testing, but the combined public RC is the recommended target for community parser comparisons.
The first public-real expansion is frozen as
DocFailBench-v0.1-public-real-rc under data/releases/:
- 7 official public PDFs downloaded and checksummed
- 20 pages with strict-reviewed accepted assertions
- 168 public-real main assertions after a second structural-enrichment pass
- 3 secondary page-furniture hygiene checks excluded from the main score
- 117 added checks covering forms, tables, section order, limited grid cells, figure/caption binding, and page-furniture hygiene
Primary sources:
- NIST AI Risk Management Framework
- NIST SP 800-53 Rev. 5
- IRS Form 1040 and Schedules A/C/D for tax year 2024
- GovInfo CFR 2024 Title 1 Volume 1
The actual 7-parser baseline has been completed for the accepted public pages:
| Parser | Passed | Failed | Score |
|---|---|---|---|
| PyMuPDF4LLM plain | 114 | 54 | 0.6786 |
| PyMuPDF4LLM bbox | 114 | 54 | 0.6786 |
Qwen-VL API (qwen-vl-ocr-latest, run 2026-05) |
91 | 77 | 0.5417 |
| Marker | 87 | 81 | 0.5179 |
| Docling | 77 | 91 | 0.4583 |
| MinerU | 67 | 101 | 0.3988 |
| PaddleOCR | 44 | 124 | 0.2619 |
The next release after the combined public RC should continue expanding non-government public pages: PMC OA / OpenStax / PubTables-1M / ACL Anthology / DocLayNet / BLS and permissive arXiv or OpenReview PDFs.
DocFailBench-v0.1-combined-public-rc folds the public-real RC, the Stage7
non-government structural RC, and the Stage8 second-reviewed subset into one
recommended community target:
- 116 cases / 877 assertions
- 7 cached parser baselines
- profile labels preserved for
public_real_rc,non_gov_stage7_structural, andnon_gov_stage8_reviewed - source and artifact manifests under
data/releases/