You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Completes the build → validate chain in-repo, staged after Phase 1's verification
contract. Clinician-anchored: a reproducible research scaffold generator that
INTEGRATES MONAI/nnU-Net/TorchIO, not a replacement. Default CI stays torch-free.
Additive: skills 46→47, integrity detectors 37→38; reporting_guidelines unchanged.
- /model-scaffold (Layer B): scaffold.py stamps a runnable PyTorch segmentation repo
(configurable U-Net, dataset/losses/train/evaluate, config, requirements,
REPRODUCIBILITY.md, methods_stub.md) with reproducibility baked in BY CONSTRUCTION:
a patient-level seed-locked split written as an auditable artifact (disjoint by a
deterministic group split, so it clears /model-validation's check_split_leakage),
all-RNG seeding + cuDNN determinism, train-only loader, eval()+no_grad() inference.
No fabricated numbers ([VERIFY] placeholders). stdlib+numpy generator (CI parity).
- check_training_hygiene.py: conservative AST linter (flag-not-prove, the training-code
analogue of check_generated_code) — SEED_INCOMPLETE / MISSING_EVAL_MODE /
TRAIN_ON_NONTRAIN_SPLIT (Major), CUDNN_NONDETERMINISTIC / EVAL_SHUFFLE (Minor).
style_review family.
- scaffold_challenge: build → validate executed network-free (scaffold → frozen
deterministic split + inline disjointness proof → check_training_hygiene → a
self-skipping torch forward tier that is SKIP, never CI coverage of runnability,
when torch is absent). + CI-wired regression test (13 cases).
All CI-mirror gates green locally (validate_skills, all gen_* --check,
validate_catalog_consistency, probe_sync, frontmatter, routing-assets, locale,
version, npm audit, both new CI steps). Version left at 4.10.0 — release is separate.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: MEDSCI_AUDIT.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# MedSci-Audit
2
2
3
-
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **37 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
3
+
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **38 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
4
4
5
5
The detectors are **deterministic** — same input, same verdict, no LLM in the decision path — so a flagged defect is reproducible and a clean run is meaningful.
6
6
@@ -15,24 +15,24 @@ MedSci-Audit detectors **find** integrity problems; they deliberately do **not**
15
15
16
16
The authoritative, machine-readable list is **[`metadata/detectors_catalog.json`](metadata/detectors_catalog.json)** — generated from the detectors under `skills/*/scripts/` by [`scripts/gen_detectors_catalog_json.py`](scripts/gen_detectors_catalog_json.py) and CI-gated with `--check` (it uses the same discovery glob as `validate_catalog_consistency.py`, so its `detector_count` always equals `catalog_counts.json::integrity_detectors`). Do not hand-maintain a parallel list; read the JSON.
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "37 detectors, validated by E1/E7" claim.
31
+
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "38 detectors, validated by E1/E7" claim.
32
32
33
-
-**Current detector catalog: 37** (the enumerated list in `metadata/detectors_catalog.json`).
33
+
-**Current detector catalog: 38** (the enumerated list in `metadata/detectors_catalog.json`).
34
34
-**Canonical evaluation runs are v3.8-era and validate the then-current subset.** The seeded-defect benchmark (**E1**) is built on **19 `DefectSpec` rows / 17 deterministic injectors** ([`evaluation/h1_seeded_defects/DEFECT_RATIONALE.md`](evaluation/h1_seeded_defects/DEFECT_RATIONALE.md)), and the coverage inventory (**E7**) is **n=21** ([`evaluation/runs/canonical/E7/limitations.md`](evaluation/runs/canonical/E7/limitations.md)). Both predate the A1–A4 detectors that brought the catalog to 24. The frozen canonical runs under [`evaluation/runs/canonical/`](evaluation/runs/canonical/) are pinned to the published methods artifacts and are intentionally left unchanged.
35
-
-**Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 37 detectors is a separate evaluation effort and is **not** part of this registry.
35
+
-**Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 38 detectors is a separate evaluation effort and is **not** part of this registry.
36
36
37
37
For the broader evaluation harness (E1–E9: seeded-defects, LLM baseline, cost/time, fresh-clone reproducibility, audit-trail completeness, portability, inventory, drift, self-review convergence), see [`evaluation/`](evaluation/).
Copy file name to clipboardExpand all lines: README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,14 +2,14 @@
2
2
3
3
# MedSci Skills
4
4
5
-
**46 skills that actually work.** Built by a physician-researcher, tested on real publications.
5
+
**47 skills that actually work.** Built by a physician-researcher, tested on real publications.
6
6
7
7
*MedSci Skills is a submission-grade clinical manuscript workflow, not a generic biomedical skill catalog. Its moat is the compliance layer — 38 reporting guidelines and risk-of-bias tools, reference/citation verification, and deterministic integrity gates, before peer review sees the manuscript. It competes on clinical submission reliability, not skill count.*
[](https://youtu.be/MclQ_RIofpE)
15
15
[](https://github.com/Aperivue/medsci-skills/contribute)
|**design-study**| Study design review: identifies analysis unit, cohort logic, data leakage risks, comparator design, validation strategy, and reporting guideline fit. |
453
453
|**design-ai-benchmarking**| Design and validity review for benchmarking AI system(s) against a human-expert panel: evaluation-question and arm definition, decoupled multi-dimensional rubrics with anchors, planted calibration probes (positive-control / known-bad / instability / mechanism-contradiction), reviewer-panel construction with per-reviewer randomization, inter-rater reliability targets with separate control-item reliability, LLM-as-judge vs human-as-judge adjudication, construct-independence guards, and a structured JSON rating-export schema. Locks the rubric before data collection. |
454
454
|**model-validation**| Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation / classification / detection): patient-level split disjointness and the data-leakage taxonomy, tuning-on-test, internal vs genuine external validation, comparator design, single-run vs multi-seed variance, task-correct metric selection (Metrics Reloaded), test-set sizing, and CLAIM 2024 / TRIPOD+AI / STARD-AI reporting fit. Ships a deterministic split-leakage gate that proves patient disjointness by set arithmetic on the emitted split table. Integrates with MONAI / nnU-Net — does not replace them. |
455
+
|**model-scaffold**| Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train/evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). Reproducibility holds by construction; ships a `check_training_hygiene` AST gate + a network-free build→validate challenge. Integrates with MONAI / nnU-Net / TorchIO — does not reimplement them. |
455
456
|**intake-project**| Classifies new research projects, summarizes current state, identifies missing inputs, and recommends next steps. |
456
457
|**grant-builder**| Structures grant proposals: significance, innovation, approach, milestones, and consortium roles. |
457
458
|**present-paper**| Academic presentation preparation: paper analysis, supporting research, speaker scripts, slide note injection, and Q&A prep. |
Copy file name to clipboardExpand all lines: docs/skills/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@ One reference page per skill, generated from each skill's `SKILL.md` and `skill.
32
32
-[manage-project](manage-project.md) — Research project management for medical manuscripts. _(evidence: manual_workflow)_
33
33
-[manage-refs](manage-refs.md) — Cross-cutting reference manager for medical manuscripts. _(evidence: bundled_script)_
34
34
-[meta-analysis](meta-analysis.md) — Systematic review and meta-analysis pipeline for medical research. _(evidence: demo)_
35
+
-[model-scaffold](model-scaffold.md) — Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. _(evidence: ci_validator)_
35
36
-[model-validation](model-validation.md) — Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation, classification, or detection) before the validation report or manuscript is written. _(evidence: ci_validator)_
36
37
-[orchestrate](orchestrate.md) — General-purpose research orchestrator. _(evidence: demo)_
37
38
-[peer-review](peer-review.md) — Peer review assistant for medical journals. _(evidence: manual_workflow)_
<!-- AUTO-GENERATED from skills/model-scaffold/SKILL.md by scripts/gen_skill_docs.py. Do not edit by hand. -->
2
+
3
+
# model-scaffold
4
+
5
+
> Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train and evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). The reproducibility guarantees hold by construction, so the build is leakage-safe before any training runs. Integrates with MONAI, nnU-Net, and TorchIO — it does not reimplement them.
`model-scaffold` activates on requests such as: model scaffold, scaffold a model, training repo, PyTorch repo, build a model, train a segmentation model, U-Net, UNet, segmentation model, nnU-Net, MONAI, dataloader, train.py, patient-level split, reproducible training, seed everything, generate training code, medical imaging model.
12
+
13
+
## Quality Card
14
+
15
+
**Purpose** — Generate a leakage-safe, reproducible training repo for a medical-imaging model so the reproducibility guarantees (patient-disjoint seed-locked split, all-RNG seeding, cuDNN determinism, eval-mode inference) hold by construction rather than by hand-editing.
16
+
17
+
**Safety boundaries**
18
+
19
+
- The split is patient-level and seed-locked by construction (deterministic group split); the generator never emits an image-level or unseeded split.
20
+
- No metric is fabricated — methods_stub.md carries [VERIFY] placeholders; numbers come only from the user's executed training and from model-evaluation / analyze-stats.
21
+
22
+
**Known limitations**
23
+
24
+
- Runnability of the generated repo (build + forward pass) is verified by an optional local torch-cpu command, not by the default CI gate (which checks the network-free parts: split disjointness + training hygiene).
25
+
- Dataset I/O is a stub (the user plugs in their DICOM / NIfTI / TIFF reader); the generator does not read pixels.
*Part of [MedSci Skills](../../README.md) — Claude Code skills for the medical research lifecycle. This page is generated from the skill's `SKILL.md`; edit that file and re-run `scripts/gen_skill_docs.py`.*
"_comment": "Single source of truth for catalog counts cited in public docs (README, orchestrate, check-reporting). scripts/validate_catalog_consistency.py recomputes every value from disk, asserts this file matches, and asserts the doc claims match. Do not hand-edit a value without running that script \u2014 CI fails on drift.",
Copy file name to clipboardExpand all lines: metadata/detectors_catalog.json
+10-2Lines changed: 10 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
{
2
2
"_comment": "AUTO-GENERATED by scripts/gen_detectors_catalog_json.py from the analysis-integrity detectors under skills/*/scripts/ (same glob as validate_catalog_consistency.py). Machine-readable registry of the MedSci-Audit detector suite (single source of truth). Do not hand-edit; CI gate: python3 scripts/gen_detectors_catalog_json.py --check.",
0 commit comments