"Every rejection is a learning opportunity. Mostly for the person who tried to sneak a pool party flyer into a metagenomics pipeline."
PV-SXNN
│ ││└┘
│ ││ └─ Detail code (00-99)
│ │└─── Sub-category
│ └──── Step number (0-8)
└────── PubVerse prefix
All error codes are printed to stdout as a single line:
PV-0400 | JUNK_DETECTED | That's not a paper, that's a cry for help (doc_type=junk:0.998)
The bouncer at the door. If your PDF can't get past here, it definitely shouldn't be anywhere near a GNN.
The PubGuard codes embed the classifier indices directly:
PV-0{doc_type}{ai_detect}{toxicity} where each digit is the
predicted class index.
| Code | Name | What Happened |
|---|---|---|
| PV-0000 | ALL_CLEAR |
Paper passed screening. Welcome to the lab. |
| PV-0100 | LITERATURE_REVIEW |
That's a review article, not original research. We appreciate the bibliography, but we need data, not a guided tour of everyone else's. |
| PV-0200 | POSTER_DETECTED |
That's a poster, not a paper. We appreciate the aesthetic effort, but we need Methods, not bullet points on a corkboard. |
| PV-0300 | ABSTRACT_ONLY |
We got the trailer but not the movie. Where's the rest of the paper? |
| PV-0400 | JUNK_DETECTED |
That's not a paper, that's a cry for help. Pool party invitations, invoices, and fantasy football drafts do not constitute peer-reviewed research. |
| PV-0010 | AI_GENERATED |
Our classifier thinks a robot wrote this. Not necessarily disqualifying, but noted for the record. The Turing test starts at the Introduction. |
| PV-0001 | TOXIC_CONTENT |
Content flagged as potentially toxic. Science should be provocative, not offensive. |
| PV-0110 | REVIEW_AND_AI |
An AI-generated literature review. The robots are now reviewing each other's work. |
| PV-0210 | POSTER_AND_AI |
An AI-generated poster. The future is here and it's making conference posters. |
| PV-0310 | ABSTRACT_AI |
An AI-generated abstract with no paper attached. Peak efficiency. |
| PV-0410 | JUNK_AND_AI |
AI-generated junk. Congratulations, you've automated mediocrity. |
| PV-0401 | JUNK_AND_TOXIC |
Toxic junk. This is somehow worse than a pool party flyer. |
| PV-0411 | JUNK_AI_TOXIC |
The trifecta. AI-generated toxic junk. We'd be impressed if we weren't horrified. |
The three middle digits encode each classifier head's prediction index:
PV-0 [doc_type] [ai_detect] [toxicity] NN
│ │ │
│ │ └─ 0=clean, 1=toxic
│ └───────────── 0=human, 1=ai_generated
└──────────────────────── 0=scientific_paper, 1=literature_review,
2=poster, 3=abstract_only, 4=junk
So PV-0000 = scientific_paper + human + clean = PASS.
Any non-zero digit in the doc_type position = hard gate (blocked).
Non-zero in ai/toxicity = soft flag (reported, not blocked by default).
Gate logic: Only scientific_paper (index 0) passes. Literature reviews,
posters, abstract-only documents, and junk are all blocked. Meta-analyses and
systematic reviews are classified as scientific_paper (they pass the gate);
only narrative and scoping reviews are classified as literature_review.
PubGuard blocks narrative and scoping literature reviews because the pipeline scores original research contributions — a review article that surveys existing work has no novel methods, data, or findings for 42DeepThought to evaluate. Feeding a review through the GNN would produce misleading scores.
What gets blocked (PV-0100):
- Narrative reviews ("A review of recent advances in...")
- Scoping reviews ("Mapping the landscape of...")
- Non-systematic overview articles
What passes as scientific_paper:
- Systematic reviews (structured methodology, PRISMA, etc.)
- Meta-analyses (quantitative synthesis of prior results)
- These contain original analytical contributions and score meaningfully.
Training data for the literature_review class comes from PubMed
PublicationType metadata and OpenAlex review-type articles. The classifier
uses the document's full text structure — not just keywords in the title — to
distinguish reviews from original research.
| Code | Name | Description |
|---|---|---|
| PV-0900 | EMPTY_INPUT |
You sent us nothing. Literally nothing. The void does not require peer review. |
| PV-0901 | UNREADABLE_PDF |
We can't read this PDF. Neither could PyMuPDF. If your PDF parser can't parse it, maybe it's not a PDF. |
| PV-0902 | MODELS_MISSING |
PubGuard models not found. Run training first: cd pub_check && python scripts/train_pubguard.py |
| PV-0999 | GATE_BYPASSED |
PubGuard screening was skipped (PUBGUARD_STRICT=0). Proceeding on faith. Good luck. |
Where we turn your lovingly formatted PDF into tab-separated values. The VLM does its best.
| Code | Name | Description |
|---|---|---|
| PV-1000 | EXTRACTION_OK |
Features extracted successfully. The VLM read your paper and didn't crash. |
| PV-1100 | EXTRACTION_FAILED |
VLM feature extraction failed. Your PDF defeated a 7-billion parameter model. Impressive, actually. |
| PV-1101 | NO_TSV_OUTPUT |
Extraction ran but produced no output file. The VLM started reading and apparently gave up. |
| PV-1102 | VLM_TIMEOUT |
Feature extraction timed out. Your PDF is either very long or very confusing. |
| PV-1103 | GPU_DRIVER_ERROR |
nvidia-smi reports a driver problem. The VLM will likely hang forever waiting for a GPU that isn't home. Fix your drivers first. |
| PV-1200 | VLM_NOT_FOUND |
Feature extraction script not found. Did someone move qwen3_local_feature_extraction_cli.py? |
The main event. Clustering, impact analysis, topic modeling — the works.
| Code | Name | Description |
|---|---|---|
| PV-2000 | ANALYSIS_OK |
PubVerse analysis completed. Your paper has been weighed, measured, and clustered. |
| PV-2100 | ANALYSIS_FAILED |
App_v5.py crashed. This is the big one — check the logs. |
| PV-2101 | STOPWORDS_MISSING |
Stopwords pickle not found. You can't do NLP without knowing which words to ignore. |
| PV-2102 | DATASET2_MISSING |
Reference dataset not found. We need something to compare your paper against. |
| PV-2103 | RECYCLE_MISSING |
Recycle overlay pickle not found. The cached cluster state is gone. Time to rebuild (grab coffee). |
Making sure Step 2 actually produced what it promised.
| Code | Name | Description |
|---|---|---|
| PV-3000 | ARTIFACTS_OK |
All expected files found. The pipeline is holding together. |
| PV-3100 | MATRIX_MISSING |
Unified adjacency matrix not found. PubVerse ran but didn't produce the matrix. Something went sideways in clustering. |
| PV-3101 | PICKLE_MISSING |
Impact analysis pickle not found. The most important intermediate file is AWOL. |
Building the knowledge graph. Nodes, edges, the whole beautiful mess.
| Code | Name | Description |
|---|---|---|
| PV-4000 | GRAPH_OK |
Knowledge graph constructed. It's a beautiful web of science. |
| PV-4100 | GRAPH_FAILED |
Graph construction failed. 42dt_graph_clean_cli.py didn't make it. |
| PV-4101 | GRAPH_PICKLE_MISSING |
Graph pickle not produced. The nodes existed briefly, like a postdoc's optimism. |
The GNN scores your paper. This is where GPUs earn their electricity bill.
| Code | Name | Description |
|---|---|---|
| PV-5000 | SCORING_OK |
42DeepThought scored your paper. May the odds be in your favor. |
| PV-5100 | SCORING_FAILED |
GNN scoring crashed. Check CUDA, check the graph, check your assumptions. |
| PV-5101 | NO_GRAPH_PICKLE |
Graph pickle file not found for scoring. Step 4 must have failed silently. |
| PV-5102 | NO_SCORES_OUTPUT |
Scoring ran but produced no TSV. The GNN had nothing to say about your paper. |
| PV-5103 | CHECKPOINT_CORRUPT |
Model checkpoint failed to load. Delete deepthought_model.pt and retrain. |
| PV-5200 | DEEPTHOUGHT_MISSING |
42DeepThought directory not found. The entire scoring engine is missing. |
| PV-5201 | LABELS_MISSING |
42d_scoring.tsv not found. No reference labels means no supervised scoring. |
Finding your paper's neighbors in topic space.
| Code | Name | Description |
|---|---|---|
| PV-6000 | CLUSTER_OK |
Cluster analysis complete. Your paper's social circle has been mapped. |
| PV-6100 | CLUSTER_FAILED |
Cluster analysis crashed. Your paper is a loner — or the code is. |
| PV-6101 | NO_SANITY_PICKLE |
Sanitycheck pickle not found. Can't do cluster analysis without clusters. |
| PV-6102 | DB_TIMEOUT |
Cluster database population timed out (>1 hour). The LLM is still thinking. |
| PV-6103 | NO_QUERY_ID |
Could not extract query paper ID from TSV. Who are you, even? |
| PV-6200 | CLUSTER_SKIPPED |
Cluster analysis skipped (prerequisites not met). Not fatal, just lonely. |
Merging all the analysis results into one JSON payload.
| Code | Name | Description |
|---|---|---|
| PV-7000 | ENRICH_OK |
Data enrichment complete. Everything is unified and beautiful. |
| PV-7100 | ENRICH_FAILED |
Enrichment script crashed. The data refused to be unified. |
| PV-7200 | ENRICH_SKIPPED |
Enrichment skipped (cluster analysis didn't complete). Can't enrich what doesn't exist. |
The grand finale. Turning data into something you can click on.
| Code | Name | Description |
|---|---|---|
| PV-8000 | VIZ_OK |
Visualization generated. Open the HTML and admire your knowledge graph. |
| PV-8100 | VIZ_FAILED |
Visualization generation failed. You'll have to imagine the graph. |
| PV-8200 | VIZ_SKIPPED |
Visualization skipped (enrichment didn't complete). No data, no graph, no glory. |
The pipeline exits with the first fatal error code's step number:
| Exit Code | Meaning |
|---|---|
0 |
Success — all steps completed (or non-fatal warnings only) |
1 |
Step 0 — PubGuard rejected the input |
2 |
Step 1 — Feature extraction failed |
3 |
Step 2 — PubVerse analysis failed |
4 |
Step 3 — Required artifacts missing |
5 |
Step 4 — Graph construction failed |
6 |
Step 5 — 42DeepThought scoring failed |
7 |
Step 6 — Cluster analysis failed (fatal only if no fallback) |
8 |
Step 7/8 — Enrichment or visualization failed |
Quick decoder ring for the PV-0XYZ codes:
X (doc_type): 0=paper ✅ 1=review 📚 2=poster 📋 3=abstract 📄 4=junk 🗑️
Y (ai_detect): 0=human ✍️ 1=ai 🤖
Z (toxicity): 0=clean ✅ 1=toxic ☠️
Examples:
PV-0000→ Paper + Human + Clean → PASS ✅PV-0100→ Review + Human + Clean → "That's a review, not original research" 📚PV-0400→ Junk + Human + Clean → "That's not a paper" 🗑️PV-0011→ Paper + AI + Toxic → "Human-passing but spicy"⚠️ PV-0411→ Junk + AI + Toxic → "The absolute worst" 🚫
PubGuard v0.1.0 — Because science has standards, even if your PDF doesn't.