This benchmark ran 13 test cases, 13 PASS, 0 FAIL. All passed.
- Benchmark root:
benchmark - Network tests: Not executed (--no-network used).
- Total risk signals: 66
- Total run/info records: 47
Conclusion: The core detection pipeline is stably covered by automated benchmarks. Deterministic mathematical checks, hash provenance, and project-level reconciliation checks serve as reliable engineering regression indicators; image checks, raw table digit distribution/inter-column relationship weak signals, and paper mill/cross-manuscript similarity are suitable for measuring 'whether review leads are surfaced', not as strong conclusion indicators.
- Raw data: Covers duplicate/highly similar rows and columns, fixed steps, high-frequency values, missing-concentrated-by-group, terminal digit distribution, inter-column relationships, and non-continuous variable anomalies; clean controls maintain 0 risk signals.
- Summary statistics: Covers SE/SD/N, CI, percent/count, p/t/df, p-value domain, and R scrutiny/SPRITE feasibility checks.
- In-text statistics: Covers R statcheck p-value consistency checks on APA/NHST expressions.
- Literature & network: Covers DOI/PMID parsing, Crossref/OpenAlex/PubPeer/NCBI metadata queries, and citation claim extraction.
- Images: Covers image discovery, internal duplicates, local copy-move, metadata quality, and Western blot/gel review checklist.
- Code & project: Covers Python/R script reruns, Stata/SPSS/SAS read-only prompts, cross-material data reconciliation, project manifest, provenance version chain, and local corpus screening.
| Tier | Tools / Capabilities | Benchmark Interpretation |
|---|---|---|
| More Reliable | crosscheck, p_value_distribution, data_trace_crosscheck, provenance_hash, provenance_chain_verify |
Mathematics, domain, or hash rules are explicit; suitable as regression thresholds. |
| Moderately Reliable | raw_data_rules, r_statcheck, r_scrutiny, r_rsprite2, code_rerun_execute |
Sensitive to input format, column names, R package versions, and script dependencies; suitable as coverage and primary anomaly capture indicators. |
| Weak Signal | raw_data_rules digit distribution/inter-column relationship/non-continuous variable shape signals, image duplicate/copy-move, papermill_light_signals, papermill_network_signals |
Only indicate that human review leads were generated; higher false positive/negative risk. |
Not executed (--no-network used).
| Case | Type | Pass | Seconds | Risk Signals | Info | Missing Tools | Missing Checks | |---|---:|---:|---:|---:|---|---| | raw_suspicious | single_run | Yes | 1.284 | 16 | 0 | | | | raw_clean_control | single_run | Yes | 1.147 | 0 | 0 | | | | summary_suspicious | single_run | Yes | 2.279 | 17 | 2 | | | | p_values_suspicious | single_run | Yes | 1.039 | 2 | 0 | | | | apa_stats_suspicious | single_run | Yes | 2.156 | 2 | 0 | | | | paper_refs_and_claims_offline | single_run | Yes | 1.072 | 0 | 4 | | | | analysis_suspicious | single_run | Yes | 1.42 | 1 | 1 | | | | analysis_manual_unsupported | single_run | Yes | 1.054 | 0 | 3 | | | | figures_project | project | Yes | 1.201 | 11 | 13 | | | | project_full | project | Yes | 2.471 | 12 | 19 | | | | corpus_screen | corpus | Yes | 2.104 | 4 | 0 | | | | provenance_change | provenance_change | Yes | 2.072 | 1 | 5 | | | | external_refs_online | project_network | Yes | 0.0 | 0 | 0 | | |
citation_claim_check: 2 casescode_rerun_audit: 4 casescode_rerun_execute: 4 casescrosscheck: 2 casesdata_trace_crosscheck: 2 casesimage_copy_move_internal: 2 casesimage_duplicate_internal: 2 casesimage_extract: 2 casesimage_metadata_audit: 2 casesp_value_distribution: 1 casespapermill_light_signals: 2 casespapermill_network_signals: 3 casesproject_audit: 1 casesprovenance_chain_verify: 3 casesprovenance_hash: 2 casesr_rsprite2: 1 casesr_scrutiny: 2 casesr_statcheck: 1 casesraw_data_rules: 1 casesreference_audit: 2 caseswestern_blot_review_list: 2 cases
raw_suspicious: Merged report generated: benchmark/reports/pcr.raw_suspicious.mdraw_clean_control: Merged report generated: benchmark/reports/pcr.raw_clean_control.mdsummary_suspicious: Merged report generated: benchmark/reports/pcr.summary_suspicious.mdp_values_suspicious: Merged report generated: benchmark/reports/pcr.p_values_suspicious.mdapa_stats_suspicious: Merged report generated: benchmark/reports/pcr.apa_stats_suspicious.mdpaper_refs_and_claims_offline: Merged report generated: benchmark/reports/pcr.paper_refs_and_claims_offline.mdanalysis_suspicious: Merged report generated: benchmark/reports/pcr.analysis_suspicious.mdanalysis_manual_unsupported: Merged report generated: benchmark/reports/pcr.analysis_manual_unsupported.mdfigures_project: Project audit report generated: benchmark/reports/pcr.figures_project.mdproject_full: Project audit report generated: benchmark/reports/pcr.project_full.mdcorpus_screen: Local corpus index generated: benchmark/reports/pcr.corpus_index.json | Local corpus screening report generated: benchmark/reports/pcr.corpus_screen.mdprovenance_change: } | }external_refs_online: network case skipped by --no-network
The high/medium/low levels in this report are benchmark risk signals, not conclusions of academic misconduct, fabrication, or fraud. info records are run statuses, dependency states, skip reasons, or coverage notes; they do not count toward risk conclusions.
Network test cases depend on real-time availability, certificate chains, credentials, and rate limiting of Crossref, OpenAlex, PubPeer, and NCBI. If network cases fail, first check HTTP/SSL/rate-limit information in evidence before concluding it is a detector regression.
All weak-signal tools are only for surfacing human review directions. Final review should return to original data, scripts, image source files, literature metadata, and audit logs.