Skip to content

Commit 90a6b58

Browse files
Yoojin-namclaude
andauthored
feat(model-lane): v5.0 Phase 2 — model-scaffold (runnable PyTorch repo generator) + training-hygiene gate (#218)
Completes the build → validate chain in-repo, staged after Phase 1's verification contract. Clinician-anchored: a reproducible research scaffold generator that INTEGRATES MONAI/nnU-Net/TorchIO, not a replacement. Default CI stays torch-free. Additive: skills 46→47, integrity detectors 37→38; reporting_guidelines unchanged. - /model-scaffold (Layer B): scaffold.py stamps a runnable PyTorch segmentation repo (configurable U-Net, dataset/losses/train/evaluate, config, requirements, REPRODUCIBILITY.md, methods_stub.md) with reproducibility baked in BY CONSTRUCTION: a patient-level seed-locked split written as an auditable artifact (disjoint by a deterministic group split, so it clears /model-validation's check_split_leakage), all-RNG seeding + cuDNN determinism, train-only loader, eval()+no_grad() inference. No fabricated numbers ([VERIFY] placeholders). stdlib+numpy generator (CI parity). - check_training_hygiene.py: conservative AST linter (flag-not-prove, the training-code analogue of check_generated_code) — SEED_INCOMPLETE / MISSING_EVAL_MODE / TRAIN_ON_NONTRAIN_SPLIT (Major), CUDNN_NONDETERMINISTIC / EVAL_SHUFFLE (Minor). style_review family. - scaffold_challenge: build → validate executed network-free (scaffold → frozen deterministic split + inline disjointness proof → check_training_hygiene → a self-skipping torch forward tier that is SKIP, never CI coverage of runnability, when torch is absent). + CI-wired regression test (13 cases). All CI-mirror gates green locally (validate_skills, all gen_* --check, validate_catalog_consistency, probe_sync, frontmatter, routing-assets, locale, version, npm audit, both new CI steps). Version left at 4.10.0 — release is separate. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 52f7bb2 commit 90a6b58

26 files changed

Lines changed: 1377 additions & 13 deletions

.claude-plugin/marketplace.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
"./skills/design-ai-benchmarking",
3333
"./skills/design-study",
3434
"./skills/generate-codebook",
35+
"./skills/model-scaffold",
3536
"./skills/model-validation",
3637
"./skills/version-dataset"
3738
]

.github/workflows/validate.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,12 @@ jobs:
239239
- name: Run model-validation split-leakage gate test
240240
run: bash skills/model-validation/tests/test_split_leakage.sh
241241

242+
- name: Run model-scaffold build→validate challenge
243+
run: bash skills/model-scaffold/scripts/scaffold_challenge/verify.sh
244+
245+
- name: Run model-scaffold training-hygiene gate test
246+
run: bash skills/model-scaffold/tests/test_training_hygiene.sh
247+
242248
- name: Run analyze-stats generated-code gate test
243249
run: bash skills/analyze-stats/tests/test_generated_code.sh
244250

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,27 @@
3838
(Major), `MISSING_SEED` (Major), `SINGLE_PARTITION` (Minor); train/validation/holdout synonyms
3939
collapse so a labelling variant never trips it. Stdlib-only, network-free, with a reproducible
4040
challenge card + CI-wired regression test. Integrity detectors 36 → 37.
41+
- **Medical-AI model-engineering lane — Phase 2 (build/scaffold).** Completes the
42+
build → validate chain in-repo, staged after Phase 1's verification contract. Clinician-anchored
43+
(a *reproducible research scaffold generator that integrates MONAI / nnU-Net*, not a replacement);
44+
default CI stays torch-free.
45+
- **New skill `/model-scaffold`** (Layer B) — `scaffold.py` stamps out a runnable PyTorch
46+
segmentation training repo (configurable U-Net, `dataset.py`, `losses.py`, `train.py`,
47+
`evaluate.py`, `config.yaml`, `requirements.txt`, `REPRODUCIBILITY.md`, `methods_stub.md`) with
48+
the reproducibility guarantees baked in **by construction**: a patient-level seed-locked split
49+
written as an auditable artifact (`splits/split_assignment.csv` + `split_seed.txt`, disjoint by
50+
construction so it clears `/model-validation`'s `check_split_leakage`), all-RNG seeding + cuDNN
51+
determinism, a train-only loader, and `eval()` + `no_grad()` inference. No fabricated numbers
52+
(`[VERIFY]` placeholders). Skills 46 → 47.
53+
- **New deterministic detector `check_training_hygiene.py`** (`/model-scaffold`) — conservative
54+
AST linter (flag-not-prove, the training-code analogue of `check_generated_code`): all RNGs
55+
seeded, cuDNN deterministic, `eval()` + `no_grad()` inference, no training on a non-train split.
56+
Verdicts `SEED_INCOMPLETE` / `MISSING_EVAL_MODE` / `TRAIN_ON_NONTRAIN_SPLIT` (Major),
57+
`CUDNN_NONDETERMINISTIC` / `EVAL_SHUFFLE` (Minor). Integrity detectors 37 → 38.
58+
- **`scaffold_challenge`** executes the build → validate chain network-free: scaffold a repo →
59+
deterministic split matches the frozen expected + is patient-disjoint (proven inline) → passes
60+
`check_training_hygiene` → a **self-skipping** torch tier (forward shape + gradients + reproducible
61+
loss when torch is installed; `SKIP`, never CI coverage of runnability, when absent).
4162

4263
## [4.10.0] - 2026-06-28
4364

MEDSCI_AUDIT.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# MedSci-Audit
22

3-
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **37 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
3+
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **38 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
44

55
The detectors are **deterministic** — same input, same verdict, no LLM in the decision path — so a flagged defect is reproducible and a clean run is meaningful.
66

@@ -15,24 +15,24 @@ MedSci-Audit detectors **find** integrity problems; they deliberately do **not**
1515

1616
The authoritative, machine-readable list is **[`metadata/detectors_catalog.json`](metadata/detectors_catalog.json)** — generated from the detectors under `skills/*/scripts/` by [`scripts/gen_detectors_catalog_json.py`](scripts/gen_detectors_catalog_json.py) and CI-gated with `--check` (it uses the same discovery glob as `validate_catalog_consistency.py`, so its `detector_count` always equals `catalog_counts.json::integrity_detectors`). Do not hand-maintain a parallel list; read the JSON.
1717

18-
The 37 detectors fall into six audit families:
18+
The 38 detectors fall into six audit families:
1919

2020
| Family | Count | Examples |
2121
|--------|------:|----------|
2222
| Numerical, cohort & pool arithmetic | 5 | `check_cohort_arithmetic`, `check_pool_consistency`, `check_artifact_coverage`, `detect_copy_divergence` |
2323
| Citation & reference integrity | 7 | `verify_refs`, `check_citation_keys`, `check_xref`, `check_csl_render`, `check_reference_adequacy`, `check_placeholders`, `check_reference_duplication` |
24-
| Style & review-process integrity | 5 | `check_classical_style`, `check_generated_code`, `check_panel_diversity`, `check_reviewer_team_consistency`, `check_paren_spans` |
24+
| Style & review-process integrity | 6 | `check_classical_style`, `check_generated_code`, `check_panel_diversity`, `check_reviewer_team_consistency`, `check_paren_spans`, `check_training_hygiene` |
2525
| Confounding, scope & estimand contracts | 4 | `check_scope_coherence`, `check_confounding_completeness`, `check_claim_artifact`, `check_null_calibration` |
2626
| Reporting compliance | 9 | `check_framework_naming`, `check_checklist_exists`, `check_checklist_version`, `check_prisma_figure`, `check_wordcount_cap`, `check_disclosure_availability`, `check_summary_box`, `check_supplement_hygiene`, `check_citation_order` |
2727
| Data preparation & validation | 7 | `check_structural_zero`, `check_reverse_coding`, `check_asset_anonymization`, `check_cross_artifact_stale`, `check_checklist_dump_leak`, `check_binning_consistency`, `check_split_leakage` |
2828

2929
## Evidence
3030

31-
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "37 detectors, validated by E1/E7" claim.
31+
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "38 detectors, validated by E1/E7" claim.
3232

33-
- **Current detector catalog: 37** (the enumerated list in `metadata/detectors_catalog.json`).
33+
- **Current detector catalog: 38** (the enumerated list in `metadata/detectors_catalog.json`).
3434
- **Canonical evaluation runs are v3.8-era and validate the then-current subset.** The seeded-defect benchmark (**E1**) is built on **19 `DefectSpec` rows / 17 deterministic injectors** ([`evaluation/h1_seeded_defects/DEFECT_RATIONALE.md`](evaluation/h1_seeded_defects/DEFECT_RATIONALE.md)), and the coverage inventory (**E7**) is **n=21** ([`evaluation/runs/canonical/E7/limitations.md`](evaluation/runs/canonical/E7/limitations.md)). Both predate the A1–A4 detectors that brought the catalog to 24. The frozen canonical runs under [`evaluation/runs/canonical/`](evaluation/runs/canonical/) are pinned to the published methods artifacts and are intentionally left unchanged.
35-
- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 37 detectors is a separate evaluation effort and is **not** part of this registry.
35+
- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 38 detectors is a separate evaluation effort and is **not** part of this registry.
3636

3737
For the broader evaluation harness (E1–E9: seeded-defects, LLM baseline, cost/time, fresh-clone reproducibility, audit-trail completeness, portability, inventory, drift, self-review convergence), see [`evaluation/`](evaluation/).
3838

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
# MedSci Skills
44

5-
**46 skills that actually work.** Built by a physician-researcher, tested on real publications.
5+
**47 skills that actually work.** Built by a physician-researcher, tested on real publications.
66

77
*MedSci Skills is a submission-grade clinical manuscript workflow, not a generic biomedical skill catalog. Its moat is the compliance layer — 38 reporting guidelines and risk-of-bias tools, reference/citation verification, and deterministic integrity gates, before peer review sees the manuscript. It competes on clinical submission reliability, not skill count.*
88

99
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
1010
[![Release](https://img.shields.io/github/v/release/Aperivue/medsci-skills?style=flat-square&color=blue)](https://github.com/Aperivue/medsci-skills/releases/latest)
1111
[![CI](https://img.shields.io/github/actions/workflow/status/Aperivue/medsci-skills/validate.yml?branch=main&style=flat-square&label=CI)](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
12-
![Skills](https://img.shields.io/badge/Skills-46-brightgreen?style=flat-square)
12+
![Skills](https://img.shields.io/badge/Skills-47-brightgreen?style=flat-square)
1313
[![npm](https://img.shields.io/npm/v/medsci-skills?style=flat-square&label=npm&color=cb3837)](https://www.npmjs.com/package/medsci-skills)
1414
[![Watch the 2-min intro](https://img.shields.io/badge/▶_Watch-2--min_intro-FF0000?style=flat-square&logo=youtube&logoColor=white)](https://youtu.be/MclQ_RIofpE)
1515
[![good first issues](https://img.shields.io/github/issues/Aperivue/medsci-skills/good%20first%20issue?style=flat-square&label=good%20first%20issues&color=7057ff)](https://github.com/Aperivue/medsci-skills/contribute)
@@ -452,6 +452,7 @@ ma-scout -> search-lit -> fulltext-retrieval -> design-study ──> write-proto
452452
| **design-study** | Study design review: identifies analysis unit, cohort logic, data leakage risks, comparator design, validation strategy, and reporting guideline fit. |
453453
| **design-ai-benchmarking** | Design and validity review for benchmarking AI system(s) against a human-expert panel: evaluation-question and arm definition, decoupled multi-dimensional rubrics with anchors, planted calibration probes (positive-control / known-bad / instability / mechanism-contradiction), reviewer-panel construction with per-reviewer randomization, inter-rater reliability targets with separate control-item reliability, LLM-as-judge vs human-as-judge adjudication, construct-independence guards, and a structured JSON rating-export schema. Locks the rubric before data collection. |
454454
| **model-validation** | Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation / classification / detection): patient-level split disjointness and the data-leakage taxonomy, tuning-on-test, internal vs genuine external validation, comparator design, single-run vs multi-seed variance, task-correct metric selection (Metrics Reloaded), test-set sizing, and CLAIM 2024 / TRIPOD+AI / STARD-AI reporting fit. Ships a deterministic split-leakage gate that proves patient disjointness by set arithmetic on the emitted split table. Integrates with MONAI / nnU-Net — does not replace them. |
455+
| **model-scaffold** | Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train/evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). Reproducibility holds by construction; ships a `check_training_hygiene` AST gate + a network-free build→validate challenge. Integrates with MONAI / nnU-Net / TorchIO — does not reimplement them. |
455456
| **intake-project** | Classifies new research projects, summarizes current state, identifies missing inputs, and recommends next steps. |
456457
| **grant-builder** | Structures grant proposals: significance, innovation, approach, milestones, and consortium roles. |
457458
| **present-paper** | Academic presentation preparation: paper analysis, supporting research, speaker scripts, slide note injection, and Q&A prep. |

docs/skills/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ One reference page per skill, generated from each skill's `SKILL.md` and `skill.
3232
- [manage-project](manage-project.md) — Research project management for medical manuscripts. _(evidence: manual_workflow)_
3333
- [manage-refs](manage-refs.md) — Cross-cutting reference manager for medical manuscripts. _(evidence: bundled_script)_
3434
- [meta-analysis](meta-analysis.md) — Systematic review and meta-analysis pipeline for medical research. _(evidence: demo)_
35+
- [model-scaffold](model-scaffold.md) — Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. _(evidence: ci_validator)_
3536
- [model-validation](model-validation.md) — Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation, classification, or detection) before the validation report or manuscript is written. _(evidence: ci_validator)_
3637
- [orchestrate](orchestrate.md) — General-purpose research orchestrator. _(evidence: demo)_
3738
- [peer-review](peer-review.md) — Peer review assistant for medical journals. _(evidence: manual_workflow)_

docs/skills/model-scaffold.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
<!-- AUTO-GENERATED from skills/model-scaffold/SKILL.md by scripts/gen_skill_docs.py. Do not edit by hand. -->
2+
3+
# model-scaffold
4+
5+
> Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train and evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). The reproducibility guarantees hold by construction, so the build is leakage-safe before any training runs. Integrates with MONAI, nnU-Net, and TorchIO — it does not reimplement them.
6+
7+
**Invoke:** `/model-scaffold` · **Tools:** Read, Write, Edit, Bash, Grep, Glob · **Model:** inherit
8+
9+
## When to use
10+
11+
`model-scaffold` activates on requests such as: model scaffold, scaffold a model, training repo, PyTorch repo, build a model, train a segmentation model, U-Net, UNet, segmentation model, nnU-Net, MONAI, dataloader, train.py, patient-level split, reproducible training, seed everything, generate training code, medical imaging model.
12+
13+
## Quality Card
14+
15+
**Purpose** — Generate a leakage-safe, reproducible training repo for a medical-imaging model so the reproducibility guarantees (patient-disjoint seed-locked split, all-RNG seeding, cuDNN determinism, eval-mode inference) hold by construction rather than by hand-editing.
16+
17+
**Safety boundaries**
18+
19+
- The split is patient-level and seed-locked by construction (deterministic group split); the generator never emits an image-level or unseeded split.
20+
- No metric is fabricated — methods_stub.md carries [VERIFY] placeholders; numbers come only from the user's executed training and from model-evaluation / analyze-stats.
21+
22+
**Known limitations**
23+
24+
- Runnability of the generated repo (build + forward pass) is verified by an optional local torch-cpu command, not by the default CI gate (which checks the network-free parts: split disjointness + training hygiene).
25+
- Dataset I/O is a stub (the user plugs in their DICOM / NIfTI / TIFF reader); the generator does not read pixels.
26+
27+
**Validation**
28+
29+
- `python3 scripts/scaffold.py --manifest <manifest.csv> --out model_repo --seed 42`
30+
- `python3 scripts/check_training_hygiene.py --repo model_repo --strict`
31+
- `bash scripts/scaffold_challenge/verify.sh # deterministic, network-free (torch tier self-skips)`
32+
33+
**Evidence**`ci_validator`
34+
35+
## Bundled resources
36+
37+
**References** (`skills/model-scaffold/references/`):
38+
39+
- `training_guide.md`
40+
41+
**Scripts** (`skills/model-scaffold/scripts/`):
42+
43+
- `check_training_hygiene.py`
44+
- `scaffold.py`
45+
- `scaffold_challenge/` (4 files)
46+
47+
## Source
48+
49+
Canonical definition: [`skills/model-scaffold/SKILL.md`](../../skills/model-scaffold/SKILL.md)
50+
51+
---
52+
53+
*Part of [MedSci Skills](../../README.md) — Claude Code skills for the medical research lifecycle. This page is generated from the skill's `SKILL.md`; edit that file and re-run `scripts/gen_skill_docs.py`.*

metadata/catalog_counts.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
{
22
"_comment": "Single source of truth for catalog counts cited in public docs (README, orchestrate, check-reporting). scripts/validate_catalog_consistency.py recomputes every value from disk, asserts this file matches, and asserts the doc claims match. Do not hand-edit a value without running that script \u2014 CI fails on drift.",
3-
"skills": 46,
3+
"skills": 47,
44
"reporting_guidelines": 38,
55
"journal_profiles_find": 73,
66
"journal_profiles_write": 55,
7-
"integrity_detectors": 37
7+
"integrity_detectors": 38
88
}

metadata/detectors_catalog.json

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"_comment": "AUTO-GENERATED by scripts/gen_detectors_catalog_json.py from the analysis-integrity detectors under skills/*/scripts/ (same glob as validate_catalog_consistency.py). Machine-readable registry of the MedSci-Audit detector suite (single source of truth). Do not hand-edit; CI gate: python3 scripts/gen_detectors_catalog_json.py --check.",
3-
"detector_count": 37,
3+
"detector_count": 38,
44
"families": [
55
{
66
"key": "numerical_cohort",
@@ -34,7 +34,8 @@
3434
"check_generated_code",
3535
"check_panel_diversity",
3636
"check_paren_spans",
37-
"check_reviewer_team_consistency"
37+
"check_reviewer_team_consistency",
38+
"check_training_hygiene"
3839
]
3940
},
4041
{
@@ -301,6 +302,13 @@
301302
"family_label": "Reporting compliance",
302303
"description": "Reader-facing supplement / tables / caption hygiene gate (self-review §J supplement pass)."
303304
},
305+
{
306+
"id": "check_training_hygiene",
307+
"skill": "model-scaffold",
308+
"family": "style_review",
309+
"family_label": "Style & review-process integrity",
310+
"description": "Training-script reproducibility-hygiene linter for a generated model repo (model-scaffold)."
311+
},
304312
{
305313
"id": "check_wordcount_cap",
306314
"skill": "sync-submission",

0 commit comments

Comments
 (0)