pre-check-research (pcr)

Catch data problems before reviewers do. pcr is a local-first, deterministic CLI toolkit for pre-submission research package screening. It routes manuscripts, data files, statistics, images, code, references, and provenance records through reproducible checks, then produces structured findings that a human reviewer can verify and explain.

pcr is built for serious review workflows: it helps you find issues early, document what was checked, preserve cautious language, and decide what needs expert follow-up before a manuscript, report, or case file moves forward. It produces review leads — not misconduct verdicts.

Where pcr Fits

Use pcr when a research package needs a documented pre-submission or pre-decision screen before it moves to a journal, supervisor, editor, institution, collaborator, or public record.

It is designed for review workflows where the reviewer has source materials available: manuscript text, source tables, raw or summary data, analysis scripts, figures, references, and provenance context. The goal is to surface reproducible review leads early enough that a human expert can verify, resolve, or explain them.

Common fits:

Before submission: researchers and PIs check tables, data, figures, references, and scripts before sending a manuscript out.
Before acceptance or sign-off: labs, editorial offices, and institutions run a consistent screening checklist across packages.
During triage: reviewers separate concrete, reproducible signals from speculation, missing material, dependency gaps, and extraction artifacts.
With AI agents: agent workflows consume route decisions and schema-bound findings instead of guessing which checks apply.

The Problem

Research integrity screening is time-consuming, error-prone, and often happens too late — after submission, during peer review, or post-publication. Journals, labs, and institutions need reliable, reproducible pre-submission checks, but existing tools are scattered across languages (R, Python), lack a unified output format, and require expert judgment to route the right tool to the right material.

What pcr Solves

pcr solves the routing, execution, and reporting problem:

Capability	How it works
Deterministic routing	Classifies input by shape (table columns, text patterns, file type), then selects applicable tools via explicit rules — no AI guessing
Unified output	Every tool emits the same JSON schema; findings merge into a single Markdown report
Local-first privacy	All computation runs on your machine; no data ever leaves your filesystem
Clear boundaries	Findings are risk signals for human review, never misconduct conclusions
Multi-material projects	Audit entire research packages: data + manuscript + code + figures + references
Graceful degradation	Missing R packages or optional dependencies become `info` records, not failures

What You Get

A run produces two complementary outputs:

Markdown report for humans: findings grouped by tool, evidence, possible normal explanations, review steps, confidence, and limitations.
JSON report for systems: schema-bound findings that can be merged, archived, diffed, or handed to an AI-assisted review workflow.

The report is designed to support a defensible review process. It does not replace source-material verification, subject-matter judgment, or independent statistical review.

Review Boundaries

Findings are risk signals, not misconduct conclusions.
Missing tools, missing dependencies, skipped checks, and unsupported material are recorded as level: info.
PDF/DOCX extraction can introduce recognition errors; important findings should be verified against source CSV/XLSX, original figures, and manuscript tables.
XLSX extraction uses deterministic layout-aware splitting for visually arranged source-data sheets, but Excel borders and blank space are still presentation cues rather than formal data semantics.
Python/R scripts can be rerun in a temporary copy; Stata/SPSS/SAS scripts are scanned read-only and flagged for controlled manual rerun.

Who Is This For

Researchers & PIs: screen your own manuscript package before submission and fix issues while there is still time.
Journal editorial offices: standardize pre-acceptance screening across submissions with reproducible, documented checks.
Research integrity officers: triage cases with explainable leads rather than starting from scratch.
Peer reviewers: supplement review with automated consistency checks.
Methods educators: teach statistical error detection with concrete, runnable examples.
AI agent pipelines: use structured JSON route decisions and findings instead of prompt-only tool selection.

Supported Checks

Category	Tools	What it catches
Raw data	`raw_data_rules`	Duplicate rows/columns, terminal digit anomalies, Benford violations, arithmetic sequences, inter-column linear transforms, high-frequency fill values, missing-by-group patterns
Summary statistics	`crosscheck`, `scrutiny`	SE/SD/N math consistency, CI centering and span, percent/count back-calculation, t/F/chi-square p-value verification, GRIM/GRIMMER/DEBIT feasibility
P-values	`p_value_collection`	Domain validity (p outside [0,1]), just-significant clustering
Statistical text	`statcheck`	APA/NHST in-text statistic vs reported p-value consistency
Images	`image_audit`	Internal duplicates (aHash/dHash/pHash), rotated/flipped copies, copy-move triage, western blot/gel review
References	`reference_audit`	DOI/PMID parsing, Crossref/OpenAlex/PubPeer/NCBI metadata queries, citation claim extraction
Code	`code_audit`, sandbox	Pattern scanning (hardcoded paths, exclusion clues), Python/R script rerun with output capture
Corpus	`corpus_signals`	Cross-manuscript text similarity (simhash, Jaccard), reference overlap, papermill phrase signals
Provenance	`provenance`	SHA-256 file hashing, append-only JSONL ledger, verify/diff change detection

Benchmark

13 synthetic cases, 13 pass, 0 fail, covering every detector family. 66 risk signals and 47 info records in an offline run.

benchmark/BENCHMARK.md · benchmark/BENCHMARK_REPORT.md

Architecture

Input materials
  Manuscripts / CSV-XLSX / summary tables / statistical text / images / code
    │
Extraction (pcr-extract)
  Heterogeneous files → CSV/TXT/JSON intermediates
    │
Deterministic routing (tool_system.py / router.py)
  Classify by shape → select applicable tools via explicit rules
    │
Thin runner (runner.py)
  Execute route-ready tools only; record skips as info
    │
Detectors (Python + R CLI)
  detectors/raw.py / crosscheck.py / tools/r/*
    │
Unified output (models.py / reporting.py)
  Finding JSON → merged Markdown report

Install

git clone https://github.com/nufegia/pre-check-research.git
cd pre-check-research
python3 -m pip install -e ".[dev]"
export PATH="$PWD/tools/r/pcr_statcheck:$PWD/tools/r/pcr_scrutiny:$PWD/tools/r/pcr_sprite:$PATH"

Optional R packages for statistical checks:

install.packages(c("statcheck", "scrutiny", "rsprite2"))

Optional image forensics:

python3 -m pip install -e ".[image]"

Try It

mkdir -p build

# Inspect which checks apply to an example summary-stat table
pcr-audit route examples/summary_stat_sample.csv --json build/route.json

# Run the applicable checks and write human + machine-readable reports
pcr-audit run examples/summary_stat_sample.csv --out build/audit.md --json build/audit.json

# Read the human report
cat build/audit.md

This example is a smoke test, not a complete review. It shows how pcr records route decisions, runs applicable checks, and emits reports. A serious pre-submission or editorial audit usually requires assembling source tables, manuscript text, figure originals, analysis scripts, references, and provenance context, then interpreting each finding against the study design and source files.

For a full package, use a project folder:

pcr-audit project path/to/project_folder --out build/project.md --json build/project.json

Usage

Single-file audit

# Route first — see which tools apply before running
pcr-audit route examples/summary_stat_sample.csv --json build/route.json

# Run with auto-detection
pcr-audit run examples/summary_stat_sample.csv --scenario auto --out build/audit.md --json build/audit.json

Auto-detection behavior by input shape:

Input shape	What runs
Raw observation tables	`raw_data_rules`: duplicates, digit distribution, column relationships, outliers
Summary-stat tables (N/mean/SD/SE/CI/p)	`crosscheck` + `scrutiny` (when R is available)
Likert or integer-score summaries	`crosscheck` + `scrutiny` + `rsprite2` (when R is available)
p-value collections	`p_value_collection`: domain checks, clustering signals
APA/NHST statistical text	`statcheck` (when R is available)
Analysis code (.py, .R, .do, .sps, .sas)	Read-only pattern scan; Python/R scripts rerun in sandbox

Explicit scenarios:

pcr-audit run data.csv --scenario raw --out build/raw.md --json build/raw.json
pcr-audit run summary.csv --scenario summary --out build/summary.md --json build/summary.json
pcr-audit run stats.txt --scenario text --out build/text.md --json build/text.json

Multi-material project audit

pcr-audit project path/to/project_folder --out build/project.md --json build/project.json
pcr-audit project path/to/project_folder --out build/offline.md --json build/offline.json --no-external-lookups --no-rerun-code

Project manifest (pcr-project.json):

{
  "project_id": "optional-id",
  "title": "optional title",
  "materials": [
    {"path": "paper.docx", "role": "manuscript"},
    {"path": "data.csv", "role": "raw_data"},
    {"path": "analysis.py", "role": "analysis_code"},
    {"path": "figures/", "role": "figures"}
  ],
  "settings": {
    "external_lookups": true,
    "grobid_url": "http://localhost:8070",
    "contact_email": ""
  }
}

Extraction and merging

# Extract tables from DOCX/PDF/XLSX. Visually arranged XLSX sheets are split into logical CSV tables when layout cues are present.
pcr-extract examples/suspicious_sample.xlsx --out build/extracted --json build/extracted.json

# Merge multiple finding JSONs
pcr-report merge build/raw.json build/crosscheck.json --out build/merged.md --json build/merged.json

Provenance and corpus

pcr-audit provenance record examples/project_minimal --json build/provenance.json
pcr-audit provenance verify examples/project_minimal --json build/verify.json
pcr-audit corpus build examples --out build/corpus-index.json
pcr-audit corpus screen examples/project_minimal --index build/corpus-index.json --out build/screen.md --json build/screen.json

Interpreting Results

level: info — run notes, missing dependencies, insufficient material, skip reasons. Not a risk finding.
low / medium / high — risk signals requiring human review. Not misconduct conclusions.
Each finding includes: evidence, calculation trace, normal explanations, review steps, confidence score, false-positive risk, and method limitations.
PDF/DOCX extraction can introduce table-recognition errors; verify important findings against source CSV/XLSX.
XLSX source-data workbooks may be split by borders and blank separators before table checks; review generated table names and ranges when layout is unusually complex.

Commands

CLI	Runtime	Purpose
`pcr-audit route`	Python	Explain deterministic tool routing
`pcr-audit run`	Python	Single-input audit pipeline
`pcr-audit project`	Python	Multi-material project audit
`pcr-audit provenance`	Python	SHA-256 provenance ledger
`pcr-audit corpus`	Python	Cross-manuscript corpus screening
`pcr-extract`	Python	Extract tables to CSV
`pcr-raw-audit`	Python	Raw-data digit distribution scan
`pcr-crosscheck`	Python	Summary-stat math cross-checks
`pcr-report merge`	Python	Merge finding JSON to Markdown
`pcr-statcheck`	R	APA/NHST reporting consistency
`pcr-scrutiny`	R	GRIM/GRIMMER/DEBIT feasibility
`pcr-sprite`	R	SPRITE discrete distribution reconstruction

Design Principles

Prefer native CLI tools in their implementation language over forced Python wrappers.
Every tool declares applicable input, dependency status, method limits, and false-positive risk.
Data-shape recognition and tool-applicability decisions live in tool_system.py, not in agent prompts.
Every tool emits unified finding JSON for downstream merging.
Missing tools, missing dependencies, and skipped checks are recorded as info, not as risk findings.

Documentation

docs/index.md — Overview and concepts
docs/getting-started.md — First commands and workflow
docs/use-cases.md — Practical workflow examples
docs/methods.md — Detector families, limits, and interpretation
docs/interpretation-boundaries.md — Responsible reporting language
llms.txt — Concise AI-agent entry point

Examples

Built-in example projects for testing and demonstration:

examples/project_minimal — Minimal project package
examples/project_questionnaire — Questionnaire/social-science summary statistics
examples/project_biomed — Biomedical data, image checklist, reference checks
examples/summary_stat_sample.csv — Summary statistics sample
examples/suspicious_sample.csv, examples/suspicious_sample.xlsx — Raw data samples

Privacy and Security

All computation is local. No data is uploaded to external services.
External lookups (Crossref, OpenAlex, PubPeer, NCBI) only query public identifiers (DOI, PMID) and can be disabled with --no-external-lookups.
Code reruns execute in temporary project copies with timeouts and minimal environment variables. This is not a strong security sandbox — treat unknown code accordingly.
SHA-256 provenance ledgers are append-only and never transmit file contents.

Citing

If you use pcr in your research integrity workflow, please cite:

CITATION.cff

Contributing

Bug reports, feature requests, and pull requests are welcome on GitHub Issues.

Development install:

python3 -m pip install -e ".[dev,image]"
python3 -m pytest tests/

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
benchmark		benchmark
docs		docs
examples		examples
output		output
src/pcr_audit		src/pcr_audit
tests		tests
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
llms.txt		llms.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pre-check-research (pcr)

Where pcr Fits

The Problem

What pcr Solves

What You Get

Review Boundaries

Who Is This For

Supported Checks

Benchmark

Architecture

Install

Try It

Usage

Single-file audit

Multi-material project audit

Extraction and merging

Provenance and corpus

Interpreting Results

Commands

Design Principles

Documentation

Examples

Privacy and Security

Citing

Contributing

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pre-check-research (pcr)

Where pcr Fits

The Problem

What pcr Solves

What You Get

Review Boundaries

Who Is This For

Supported Checks

Benchmark

Architecture

Install

Try It

Usage

Single-file audit

Multi-material project audit

Extraction and merging

Provenance and corpus

Interpreting Results

Commands

Design Principles

Documentation

Examples

Privacy and Security

Citing

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages