Aperivue
diff --git a/‎.claude-plugin/marketplace.json‎
Lines changed: 1 addition & 0 deletions b/‎.claude-plugin/marketplace.json‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/validate.yml‎
Lines changed: 6 additions & 0 deletions b/‎.github/workflows/validate.yml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎MEDSCI_AUDIT.md‎
Lines changed: 6 additions & 6 deletions b/‎MEDSCI_AUDIT.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 2 deletions b/‎README.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/skills/README.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/skills/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/skills/model-scaffold.md‎
Lines changed: 53 additions & 0 deletions b/‎docs/skills/model-scaffold.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎metadata/catalog_counts.json‎
Lines changed: 2 additions & 2 deletions b/‎metadata/catalog_counts.json‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎metadata/detectors_catalog.json‎
Lines changed: 10 additions & 2 deletions b/‎metadata/detectors_catalog.json‎
Lines changed: 10 additions & 2 deletions
@@ -32,6 +32,7 @@
         "./skills/design-ai-benchmarking",
         "./skills/design-study",
         "./skills/generate-codebook",
+        "./skills/model-scaffold",
         "./skills/model-validation",
         "./skills/version-dataset"
       ]
 
@@ -239,6 +239,12 @@ jobs:
       - name: Run model-validation split-leakage gate test
         run: bash skills/model-validation/tests/test_split_leakage.sh
 
+      - name: Run model-scaffold build→validate challenge
+        run: bash skills/model-scaffold/scripts/scaffold_challenge/verify.sh
+
+      - name: Run model-scaffold training-hygiene gate test
+        run: bash skills/model-scaffold/tests/test_training_hygiene.sh
+
       - name: Run analyze-stats generated-code gate test
         run: bash skills/analyze-stats/tests/test_generated_code.sh
 
 
@@ -38,6 +38,27 @@
     (Major), `MISSING_SEED` (Major), `SINGLE_PARTITION` (Minor); train/validation/holdout synonyms
     collapse so a labelling variant never trips it. Stdlib-only, network-free, with a reproducible
     challenge card + CI-wired regression test. Integrity detectors 36 → 37.
+- **Medical-AI model-engineering lane — Phase 2 (build/scaffold).** Completes the
+  build → validate chain in-repo, staged after Phase 1's verification contract. Clinician-anchored
+  (a *reproducible research scaffold generator that integrates MONAI / nnU-Net*, not a replacement);
+  default CI stays torch-free.
+  - **New skill `/model-scaffold`** (Layer B) — `scaffold.py` stamps out a runnable PyTorch
+    segmentation training repo (configurable U-Net, `dataset.py`, `losses.py`, `train.py`,
+    `evaluate.py`, `config.yaml`, `requirements.txt`, `REPRODUCIBILITY.md`, `methods_stub.md`) with
+    the reproducibility guarantees baked in **by construction**: a patient-level seed-locked split
+    written as an auditable artifact (`splits/split_assignment.csv` + `split_seed.txt`, disjoint by
+    construction so it clears `/model-validation`'s `check_split_leakage`), all-RNG seeding + cuDNN
+    determinism, a train-only loader, and `eval()` + `no_grad()` inference. No fabricated numbers
+    (`[VERIFY]` placeholders). Skills 46 → 47.
+  - **New deterministic detector `check_training_hygiene.py`** (`/model-scaffold`) — conservative
+    AST linter (flag-not-prove, the training-code analogue of `check_generated_code`): all RNGs
+    seeded, cuDNN deterministic, `eval()` + `no_grad()` inference, no training on a non-train split.
+    Verdicts `SEED_INCOMPLETE` / `MISSING_EVAL_MODE` / `TRAIN_ON_NONTRAIN_SPLIT` (Major),
+    `CUDNN_NONDETERMINISTIC` / `EVAL_SHUFFLE` (Minor). Integrity detectors 37 → 38.
+  - **`scaffold_challenge`** executes the build → validate chain network-free: scaffold a repo →
+    deterministic split matches the frozen expected + is patient-disjoint (proven inline) → passes
+    `check_training_hygiene` → a **self-skipping** torch tier (forward shape + gradients + reproducible
+    loss when torch is installed; `SKIP`, never CI coverage of runnability, when absent).
 
 ## [4.10.0] - 2026-06-28
 
 
@@ -1,6 +1,6 @@
 # MedSci-Audit
 
-**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **37 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
+**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **38 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
 
 The detectors are **deterministic** — same input, same verdict, no LLM in the decision path — so a flagged defect is reproducible and a clean run is meaningful.
 
@@ -15,24 +15,24 @@ MedSci-Audit detectors **find** integrity problems; they deliberately do **not**
 
 The authoritative, machine-readable list is **[`metadata/detectors_catalog.json`](metadata/detectors_catalog.json)** — generated from the detectors under `skills/*/scripts/` by [`scripts/gen_detectors_catalog_json.py`](scripts/gen_detectors_catalog_json.py) and CI-gated with `--check` (it uses the same discovery glob as `validate_catalog_consistency.py`, so its `detector_count` always equals `catalog_counts.json::integrity_detectors`). Do not hand-maintain a parallel list; read the JSON.
 
-The 37 detectors fall into six audit families:
+The 38 detectors fall into six audit families:
 
 | Family | Count | Examples |
 |--------|------:|----------|
 | Numerical, cohort & pool arithmetic | 5 | `check_cohort_arithmetic`, `check_pool_consistency`, `check_artifact_coverage`, `detect_copy_divergence` |
 | Citation & reference integrity | 7 | `verify_refs`, `check_citation_keys`, `check_xref`, `check_csl_render`, `check_reference_adequacy`, `check_placeholders`, `check_reference_duplication` |
-| Style & review-process integrity | 5 | `check_classical_style`, `check_generated_code`, `check_panel_diversity`, `check_reviewer_team_consistency`, `check_paren_spans` |
+| Style & review-process integrity | 6 | `check_classical_style`, `check_generated_code`, `check_panel_diversity`, `check_reviewer_team_consistency`, `check_paren_spans`, `check_training_hygiene` |
 | Confounding, scope & estimand contracts | 4 | `check_scope_coherence`, `check_confounding_completeness`, `check_claim_artifact`, `check_null_calibration` |
 | Reporting compliance | 9 | `check_framework_naming`, `check_checklist_exists`, `check_checklist_version`, `check_prisma_figure`, `check_wordcount_cap`, `check_disclosure_availability`, `check_summary_box`, `check_supplement_hygiene`, `check_citation_order` |
 | Data preparation & validation | 7 | `check_structural_zero`, `check_reverse_coding`, `check_asset_anonymization`, `check_cross_artifact_stale`, `check_checklist_dump_leak`, `check_binning_consistency`, `check_split_leakage` |
 
 ## Evidence
 
-The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "37 detectors, validated by E1/E7" claim.
+The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "38 detectors, validated by E1/E7" claim.
 
-- **Current detector catalog: 37** (the enumerated list in `metadata/detectors_catalog.json`).
+- **Current detector catalog: 38** (the enumerated list in `metadata/detectors_catalog.json`).
 - **Canonical evaluation runs are v3.8-era and validate the then-current subset.** The seeded-defect benchmark (**E1**) is built on **19 `DefectSpec` rows / 17 deterministic injectors** ([`evaluation/h1_seeded_defects/DEFECT_RATIONALE.md`](evaluation/h1_seeded_defects/DEFECT_RATIONALE.md)), and the coverage inventory (**E7**) is **n=21** ([`evaluation/runs/canonical/E7/limitations.md`](evaluation/runs/canonical/E7/limitations.md)). Both predate the A1–A4 detectors that brought the catalog to 24. The frozen canonical runs under [`evaluation/runs/canonical/`](evaluation/runs/canonical/) are pinned to the published methods artifacts and are intentionally left unchanged.
-- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 37 detectors is a separate evaluation effort and is **not** part of this registry.
+- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 38 detectors is a separate evaluation effort and is **not** part of this registry.
 
 For the broader evaluation harness (E1–E9: seeded-defects, LLM baseline, cost/time, fresh-clone reproducibility, audit-trail completeness, portability, inventory, drift, self-review convergence), see [`evaluation/`](evaluation/).
 
 
@@ -2,14 +2,14 @@
 
 # MedSci Skills
 
-**46 skills that actually work.** Built by a physician-researcher, tested on real publications.
+**47 skills that actually work.** Built by a physician-researcher, tested on real publications.
 
 *MedSci Skills is a submission-grade clinical manuscript workflow, not a generic biomedical skill catalog. Its moat is the compliance layer — 38 reporting guidelines and risk-of-bias tools, reference/citation verification, and deterministic integrity gates, before peer review sees the manuscript. It competes on clinical submission reliability, not skill count.*
 
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
 [![Release](https://img.shields.io/github/v/release/Aperivue/medsci-skills?style=flat-square&color=blue)](https://github.com/Aperivue/medsci-skills/releases/latest)
 [![CI](https://img.shields.io/github/actions/workflow/status/Aperivue/medsci-skills/validate.yml?branch=main&style=flat-square&label=CI)](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
-![Skills](https://img.shields.io/badge/Skills-46-brightgreen?style=flat-square)
+![Skills](https://img.shields.io/badge/Skills-47-brightgreen?style=flat-square)
 [![npm](https://img.shields.io/npm/v/medsci-skills?style=flat-square&label=npm&color=cb3837)](https://www.npmjs.com/package/medsci-skills)
 [![Watch the 2-min intro](https://img.shields.io/badge/▶_Watch-2--min_intro-FF0000?style=flat-square&logo=youtube&logoColor=white)](https://youtu.be/MclQ_RIofpE)
 [![good first issues](https://img.shields.io/github/issues/Aperivue/medsci-skills/good%20first%20issue?style=flat-square&label=good%20first%20issues&color=7057ff)](https://github.com/Aperivue/medsci-skills/contribute)
@@ -452,6 +452,7 @@ ma-scout -> search-lit -> fulltext-retrieval -> design-study ──> write-proto
 | **design-study** | Study design review: identifies analysis unit, cohort logic, data leakage risks, comparator design, validation strategy, and reporting guideline fit. |
 | **design-ai-benchmarking** | Design and validity review for benchmarking AI system(s) against a human-expert panel: evaluation-question and arm definition, decoupled multi-dimensional rubrics with anchors, planted calibration probes (positive-control / known-bad / instability / mechanism-contradiction), reviewer-panel construction with per-reviewer randomization, inter-rater reliability targets with separate control-item reliability, LLM-as-judge vs human-as-judge adjudication, construct-independence guards, and a structured JSON rating-export schema. Locks the rubric before data collection. |
 | **model-validation** | Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation / classification / detection): patient-level split disjointness and the data-leakage taxonomy, tuning-on-test, internal vs genuine external validation, comparator design, single-run vs multi-seed variance, task-correct metric selection (Metrics Reloaded), test-set sizing, and CLAIM 2024 / TRIPOD+AI / STARD-AI reporting fit. Ships a deterministic split-leakage gate that proves patient disjointness by set arithmetic on the emitted split table. Integrates with MONAI / nnU-Net — does not replace them. |
+| **model-scaffold** | Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train/evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). Reproducibility holds by construction; ships a `check_training_hygiene` AST gate + a network-free build→validate challenge. Integrates with MONAI / nnU-Net / TorchIO — does not reimplement them. |
 | **intake-project** | Classifies new research projects, summarizes current state, identifies missing inputs, and recommends next steps. |
 | **grant-builder** | Structures grant proposals: significance, innovation, approach, milestones, and consortium roles. |
 | **present-paper** | Academic presentation preparation: paper analysis, supporting research, speaker scripts, slide note injection, and Q&A prep. |
 
@@ -32,6 +32,7 @@ One reference page per skill, generated from each skill's `SKILL.md` and `skill.
 - [manage-project](manage-project.md) — Research project management for medical manuscripts. _(evidence: manual_workflow)_
 - [manage-refs](manage-refs.md) — Cross-cutting reference manager for medical manuscripts. _(evidence: bundled_script)_
 - [meta-analysis](meta-analysis.md) — Systematic review and meta-analysis pipeline for medical research. _(evidence: demo)_
+- [model-scaffold](model-scaffold.md) — Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. _(evidence: ci_validator)_
 - [model-validation](model-validation.md) — Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation, classification, or detection) before the validation report or manuscript is written. _(evidence: ci_validator)_
 - [orchestrate](orchestrate.md) — General-purpose research orchestrator. _(evidence: demo)_
 - [peer-review](peer-review.md) — Peer review assistant for medical journals. _(evidence: manual_workflow)_
 
@@ -0,0 +1,53 @@
+<!-- AUTO-GENERATED from skills/model-scaffold/SKILL.md by scripts/gen_skill_docs.py. Do not edit by hand. -->
+
+# model-scaffold
+
+> Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train and evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). The reproducibility guarantees hold by construction, so the build is leakage-safe before any training runs. Integrates with MONAI, nnU-Net, and TorchIO — it does not reimplement them.
+
+**Invoke:** `/model-scaffold` · **Tools:** Read, Write, Edit, Bash, Grep, Glob · **Model:** inherit
+
+## When to use
+
+`model-scaffold` activates on requests such as: model scaffold, scaffold a model, training repo, PyTorch repo, build a model, train a segmentation model, U-Net, UNet, segmentation model, nnU-Net, MONAI, dataloader, train.py, patient-level split, reproducible training, seed everything, generate training code, medical imaging model.
+
+## Quality Card
+
+**Purpose** — Generate a leakage-safe, reproducible training repo for a medical-imaging model so the reproducibility guarantees (patient-disjoint seed-locked split, all-RNG seeding, cuDNN determinism, eval-mode inference) hold by construction rather than by hand-editing.
+
+**Safety boundaries**
+
+- The split is patient-level and seed-locked by construction (deterministic group split); the generator never emits an image-level or unseeded split.
+- No metric is fabricated — methods_stub.md carries [VERIFY] placeholders; numbers come only from the user's executed training and from model-evaluation / analyze-stats.
+
+**Known limitations**
+
+- Runnability of the generated repo (build + forward pass) is verified by an optional local torch-cpu command, not by the default CI gate (which checks the network-free parts: split disjointness + training hygiene).
+- Dataset I/O is a stub (the user plugs in their DICOM / NIfTI / TIFF reader); the generator does not read pixels.
+
+**Validation**
+
+- `python3 scripts/scaffold.py --manifest <manifest.csv> --out model_repo --seed 42`
+- `python3 scripts/check_training_hygiene.py --repo model_repo --strict`
+- `bash scripts/scaffold_challenge/verify.sh  # deterministic, network-free (torch tier self-skips)`
+
+**Evidence** — `ci_validator`
+
+## Bundled resources
+
+**References** (`skills/model-scaffold/references/`):
+
+- `training_guide.md`
+
+**Scripts** (`skills/model-scaffold/scripts/`):
+
+- `check_training_hygiene.py`
+- `scaffold.py`
+- `scaffold_challenge/` (4 files)
+
+## Source
+
+Canonical definition: [`skills/model-scaffold/SKILL.md`](../../skills/model-scaffold/SKILL.md)
+
+---
+
+*Part of [MedSci Skills](../../README.md) — Claude Code skills for the medical research lifecycle. This page is generated from the skill's `SKILL.md`; edit that file and re-run `scripts/gen_skill_docs.py`.*
@@ -1,8 +1,8 @@
 {
   "_comment": "Single source of truth for catalog counts cited in public docs (README, orchestrate, check-reporting). scripts/validate_catalog_consistency.py recomputes every value from disk, asserts this file matches, and asserts the doc claims match. Do not hand-edit a value without running that script \u2014 CI fails on drift.",
-  "skills": 46,
+  "skills": 47,
   "reporting_guidelines": 38,
   "journal_profiles_find": 73,
   "journal_profiles_write": 55,
-  "integrity_detectors": 37
+  "integrity_detectors": 38
 }
@@ -1,6 +1,6 @@
 {
   "_comment": "AUTO-GENERATED by scripts/gen_detectors_catalog_json.py from the analysis-integrity detectors under skills/*/scripts/ (same glob as validate_catalog_consistency.py). Machine-readable registry of the MedSci-Audit detector suite (single source of truth). Do not hand-edit; CI gate: python3 scripts/gen_detectors_catalog_json.py --check.",
-  "detector_count": 37,
+  "detector_count": 38,
   "families": [
     {
       "key": "numerical_cohort",
@@ -34,7 +34,8 @@
         "check_generated_code",
         "check_panel_diversity",
         "check_paren_spans",
-        "check_reviewer_team_consistency"
+        "check_reviewer_team_consistency",
+        "check_training_hygiene"
       ]
     },
     {
@@ -301,6 +302,13 @@
       "family_label": "Reporting compliance",
       "description": "Reader-facing supplement / tables / caption hygiene gate (self-review §J supplement pass)."
     },
+    {
+      "id": "check_training_hygiene",
+      "skill": "model-scaffold",
+      "family": "style_review",
+      "family_label": "Style & review-process integrity",
+      "description": "Training-script reproducibility-hygiene linter for a generated model repo (model-scaffold)."
+    },
     {
       "id": "check_wordcount_cap",
       "skill": "sync-submission",
Original file line number	Diff line number	Diff line change
`@@ -32,6 +32,7 @@`
`32`	`32`	`"./skills/design-ai-benchmarking",`
`33`	`33`	`"./skills/design-study",`
`34`	`34`	`"./skills/generate-codebook",`
	`35`	`+ "./skills/model-scaffold",`
`35`	`36`	`"./skills/model-validation",`
`36`	`37`	`"./skills/version-dataset"`
`37`	`38`	`]`
Original file line number	Diff line number	Diff line change
`@@ -1,8 +1,8 @@`
`1`	`1`	`{`
`2`	`2`	`"_comment": "Single source of truth for catalog counts cited in public docs (README, orchestrate, check-reporting). scripts/validate_catalog_consistency.py recomputes every value from disk, asserts this file matches, and asserts the doc claims match. Do not hand-edit a value without running that script \u2014 CI fails on drift.",`
`3`		`- "skills": 46,`
	`3`	`+ "skills": 47,`
`4`	`4`	`"reporting_guidelines": 38,`
`5`	`5`	`"journal_profiles_find": 73,`
`6`	`6`	`"journal_profiles_write": 55,`
`7`		`- "integrity_detectors": 37`
	`7`	`+ "integrity_detectors": 38`
`8`	`8`	`}`