This directory is the benchmark home for PCR. It contains synthetic fixtures, expected coverage, a runner, and generated reports.
generate_synthetic_benchmark.py: regenerates benchmark fixtures under this directory.benchmark_manifest.json: declares benchmark cases, expected tools, expected checks, and minimum/maximum finding counts.run_benchmark.py: runs the suite and writesBENCHMARK_REPORT.md,reports/pcr.benchmark_summary.json, andreports/pcr.benchmark_summary.md.inputs/: single-file, project, image, code, provenance, and network fixtures.corpus/: local cross-manuscript corpus fixture.reports/: benchmark outputs from PCR CLIs.BENCHMARK_REPORT.md: latest human-readable benchmark summary at the benchmark root.
From the repo root:
python3 benchmark/run_benchmark.pyTo skip external Crossref/OpenAlex/PubPeer/NCBI calls:
python3 benchmark/run_benchmark.py --no-networkTo regenerate fixtures before running:
python3 benchmark/run_benchmark.py --regenerateThe suite covers raw data rules, including digit distribution, high-similarity rows/columns, column relationships, rare categories, and ordinal concentration; summary-stat crosscheck; R scrutiny; R statcheck; R rsprite2; p-value collection checks; reference parsing; external metadata lookup; citation claim extraction; papermill light/network signals; image duplicate/copy-move/metadata review; code scan/rerun; unsupported code recording; data trace crosscheck; provenance record/verify; and local corpus screening.
Network coverage uses inputs/project_external and expects evidence from Crossref, OpenAlex, PubPeer, and NCBI. Network failures should be interpreted separately from detector regressions because external APIs can be unavailable, rate-limited, require credentials, or return changed metadata.
Findings are risk signals, not misconduct verdicts. info records are operational status, dependency status, skip reasons, or coverage notes. Weak-signal capabilities such as image forensics, raw-table digit distribution/column relationships, and papermill similarity should be evaluated for surfacing review leads, not for proving a conclusion.