Skip to content

Commit e274b08

Browse files
Yoojin-namclaude
andauthored
feat(model-lane): v5.0 Phase 3 — model-card (Model Card + Datasheet + METRIC) + completeness gate (#220)
The reporting/documentation seam of the model-engineering lane, after validation (Phase 1) and build (Phase 2). Clinician-anchored, additive: skills 48→49, detectors 38→39; reporting_guidelines UNCHANGED (Model Card/Datasheet are documentation standards vendored as uncounted templates, per the appraisal_tools precedent — Codex policy). - /model-card (Layer C): generate a Model Card (Mitchell 2019) + dataset Datasheet (Gebru 2021) + a METRIC-informed data-quality pass (Schwabe 2024), filled from user-supplied facts — never fabricated (intended use, out-of-scope, training data, per-subgroup performance, caveats, provenance, consent, licence; unknown stays [NEEDS INPUT]). Mirrors version-dataset structurally (generate + verify). - check_model_card_complete.py: presence gate — every required Model Card/Datasheet section present and non-empty (not missing, not an unfilled placeholder). Verdicts MISSING_SECTION / EMPTY_REQUIRED_SECTION (Major); presence, not truth. Flattens bodies + strips whole placeholder spans + bold field-labels so wrapped [NEEDS INPUT] reads as unfilled; N/A and None count as filled answers. reporting_compliance family. - references/ templates (uncounted): model_card_template.md, datasheet_template.md, metric_dimensions.md. + reproducible challenge (synthetic complete + incomplete fixtures) + CI-wired regression test (8 cases). All CI-mirror gates green locally (validate_skills, all gen_* --check, validate_catalog_consistency, frontmatter, routing-assets, locale, version, npm, both new CI steps). Version left at 4.10.0 — release is a separate gated step. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent a0d18ba commit e274b08

26 files changed

Lines changed: 890 additions & 12 deletions

.claude-plugin/marketplace.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
"./skills/design-ai-benchmarking",
3434
"./skills/design-study",
3535
"./skills/generate-codebook",
36+
"./skills/model-card",
3637
"./skills/model-scaffold",
3738
"./skills/model-validation",
3839
"./skills/version-dataset"

.github/workflows/validate.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,12 @@ jobs:
245245
- name: Run model-scaffold training-hygiene gate test
246246
run: bash skills/model-scaffold/tests/test_training_hygiene.sh
247247

248+
- name: Run model-card completeness challenge
249+
run: bash skills/model-card/scripts/check_model_card_complete_challenge/verify.sh
250+
251+
- name: Run model-card completeness gate test
252+
run: bash skills/model-card/tests/test_model_card_complete.sh
253+
248254
- name: Run analyze-stats generated-code gate test
249255
run: bash skills/analyze-stats/tests/test_generated_code.sh
250256

CHANGELOG.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,21 @@
6969
MedSAM2 / TotalSegmentator / SegVol / BiomedCLIP / DINO / MAE / SimCLR / MoCo) families. Every
7070
recommendation names its source paper; it teaches archetypes, not a live SOTA leaderboard. Skills
7171
47 → 48.
72+
- **Medical-AI model-engineering lane — Phase 3 (reporting).** The documentation seam of the lane,
73+
after validation (Phase 1) and build (Phase 2). Clinician-anchored, additive.
74+
- **New skill `/model-card`** (Layer C) — generate the documentation an engineer-built model must
75+
carry: a **Model Card** (Mitchell et al., *FAccT* 2019), a dataset **Datasheet** (Gebru et al.,
76+
*CACM* 2021), and a **METRIC-informed data-quality pass** (Schwabe et al., *npj Digit Med* 2024),
77+
filled from user-supplied facts — never fabricated (intended use, out-of-scope use, training data,
78+
per-subgroup performance, caveats, provenance, consent, licence). Templates live in `references/`
79+
and are **uncounted** (documentation standards, not clinical reporting checklists — same treatment
80+
as `appraisal_tools/METRICS.md`), so `reporting_guidelines` is unchanged. Skills 48 → 49.
81+
- **New deterministic detector `check_model_card_complete.py`** (`/model-card`) — verifies every
82+
required Model Card / Datasheet section is **present and non-empty** (not missing, not an unfilled
83+
`[NEEDS INPUT]` placeholder). Verdicts `MISSING_SECTION` / `EMPTY_REQUIRED_SECTION` (Major); a
84+
presence check, not a truth check. `reporting_compliance` family. Integrity detectors 38 → 39.
85+
- Reproducible challenge (`check_model_card_complete_challenge`, synthetic complete + incomplete
86+
fixtures) + CI-wired regression test (8 cases).
7287

7388
## [4.10.0] - 2026-06-28
7489

MEDSCI_AUDIT.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# MedSci-Audit
22

3-
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **38 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
3+
**MedSci-Audit** is the named deterministic verification layer inside [MedSci Skills](README.md): a suite of **39 stdlib-only detectors** that catch fabricated, drifted, or non-compliant content in a medical manuscript *before* it reaches a reviewer. The detectors run inside the skills that own them (e.g. `/self-review`, `/check-reporting`, `/sync-submission`, `/verify-refs`); this document names and indexes that suite so it can be cited and reasoned about as one thing.
44

55
The detectors are **deterministic** — same input, same verdict, no LLM in the decision path — so a flagged defect is reproducible and a clean run is meaningful.
66

@@ -15,24 +15,24 @@ MedSci-Audit detectors **find** integrity problems; they deliberately do **not**
1515

1616
The authoritative, machine-readable list is **[`metadata/detectors_catalog.json`](metadata/detectors_catalog.json)** — generated from the detectors under `skills/*/scripts/` by [`scripts/gen_detectors_catalog_json.py`](scripts/gen_detectors_catalog_json.py) and CI-gated with `--check` (it uses the same discovery glob as `validate_catalog_consistency.py`, so its `detector_count` always equals `catalog_counts.json::integrity_detectors`). Do not hand-maintain a parallel list; read the JSON.
1717

18-
The 38 detectors fall into six audit families:
18+
The 39 detectors fall into six audit families:
1919

2020
| Family | Count | Examples |
2121
|--------|------:|----------|
2222
| Numerical, cohort & pool arithmetic | 5 | `check_cohort_arithmetic`, `check_pool_consistency`, `check_artifact_coverage`, `detect_copy_divergence` |
2323
| Citation & reference integrity | 7 | `verify_refs`, `check_citation_keys`, `check_xref`, `check_csl_render`, `check_reference_adequacy`, `check_placeholders`, `check_reference_duplication` |
2424
| Style & review-process integrity | 6 | `check_classical_style`, `check_generated_code`, `check_panel_diversity`, `check_reviewer_team_consistency`, `check_paren_spans`, `check_training_hygiene` |
2525
| Confounding, scope & estimand contracts | 4 | `check_scope_coherence`, `check_confounding_completeness`, `check_claim_artifact`, `check_null_calibration` |
26-
| Reporting compliance | 9 | `check_framework_naming`, `check_checklist_exists`, `check_checklist_version`, `check_prisma_figure`, `check_wordcount_cap`, `check_disclosure_availability`, `check_summary_box`, `check_supplement_hygiene`, `check_citation_order` |
26+
| Reporting compliance | 10 | `check_framework_naming`, `check_checklist_exists`, `check_checklist_version`, `check_prisma_figure`, `check_wordcount_cap`, `check_disclosure_availability`, `check_summary_box`, `check_supplement_hygiene`, `check_citation_order`, `check_model_card_complete` |
2727
| Data preparation & validation | 7 | `check_structural_zero`, `check_reverse_coding`, `check_asset_anonymization`, `check_cross_artifact_stale`, `check_checklist_dump_leak`, `check_binning_consistency`, `check_split_leakage` |
2828

2929
## Evidence
3030

31-
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "38 detectors, validated by E1/E7" claim.
31+
The suite's evaluation evidence and its current size are **two separate facts** — they are reported at different versions, and should not be collapsed into a single "39 detectors, validated by E1/E7" claim.
3232

33-
- **Current detector catalog: 38** (the enumerated list in `metadata/detectors_catalog.json`).
33+
- **Current detector catalog: 39** (the enumerated list in `metadata/detectors_catalog.json`).
3434
- **Canonical evaluation runs are v3.8-era and validate the then-current subset.** The seeded-defect benchmark (**E1**) is built on **19 `DefectSpec` rows / 17 deterministic injectors** ([`evaluation/h1_seeded_defects/DEFECT_RATIONALE.md`](evaluation/h1_seeded_defects/DEFECT_RATIONALE.md)), and the coverage inventory (**E7**) is **n=21** ([`evaluation/runs/canonical/E7/limitations.md`](evaluation/runs/canonical/E7/limitations.md)). Both predate the A1–A4 detectors that brought the catalog to 24. The frozen canonical runs under [`evaluation/runs/canonical/`](evaluation/runs/canonical/) are pinned to the published methods artifacts and are intentionally left unchanged.
35-
- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 38 detectors is a separate evaluation effort and is **not** part of this registry.
35+
- **Detectors added since v3.8 are covered by their own per-skill CI tests** (e.g. `skills/sync-submission/tests/test_asset_anonymization.sh`, `skills/check-reporting/tests/test_checklist_version.sh`, `skills/write-paper/tests/test_placeholders.sh`), run on every push via [`.github/workflows/validate.yml`](.github/workflows/validate.yml) — not by a re-run of the frozen E1/E7. A refresh of E1/E7 to cover all 39 detectors is a separate evaluation effort and is **not** part of this registry.
3636

3737
For the broader evaluation harness (E1–E9: seeded-defects, LLM baseline, cost/time, fresh-clone reproducibility, audit-trail completeness, portability, inventory, drift, self-review convergence), see [`evaluation/`](evaluation/).
3838

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
# MedSci Skills
44

5-
**48 skills that actually work.** Built by a physician-researcher, tested on real publications.
5+
**49 skills that actually work.** Built by a physician-researcher, tested on real publications.
66

77
*MedSci Skills is a submission-grade clinical manuscript workflow, not a generic biomedical skill catalog. Its moat is the compliance layer — 38 reporting guidelines and risk-of-bias tools, reference/citation verification, and deterministic integrity gates, before peer review sees the manuscript. It competes on clinical submission reliability, not skill count.*
88

99
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
1010
[![Release](https://img.shields.io/github/v/release/Aperivue/medsci-skills?style=flat-square&color=blue)](https://github.com/Aperivue/medsci-skills/releases/latest)
1111
[![CI](https://img.shields.io/github/actions/workflow/status/Aperivue/medsci-skills/validate.yml?branch=main&style=flat-square&label=CI)](https://github.com/Aperivue/medsci-skills/actions/workflows/validate.yml)
12-
![Skills](https://img.shields.io/badge/Skills-48-brightgreen?style=flat-square)
12+
![Skills](https://img.shields.io/badge/Skills-49-brightgreen?style=flat-square)
1313
[![npm](https://img.shields.io/npm/v/medsci-skills?style=flat-square&label=npm&color=cb3837)](https://www.npmjs.com/package/medsci-skills)
1414
[![Watch the 2-min intro](https://img.shields.io/badge/▶_Watch-2--min_intro-FF0000?style=flat-square&logo=youtube&logoColor=white)](https://youtu.be/MclQ_RIofpE)
1515
[![good first issues](https://img.shields.io/github/issues/Aperivue/medsci-skills/good%20first%20issue?style=flat-square&label=good%20first%20issues&color=7057ff)](https://github.com/Aperivue/medsci-skills/contribute)
@@ -454,6 +454,7 @@ ma-scout -> search-lit -> fulltext-retrieval -> design-study ──> write-proto
454454
| **model-validation** | Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation / classification / detection): patient-level split disjointness and the data-leakage taxonomy, tuning-on-test, internal vs genuine external validation, comparator design, single-run vs multi-seed variance, task-correct metric selection (Metrics Reloaded), test-set sizing, and CLAIM 2024 / TRIPOD+AI / STARD-AI reporting fit. Ships a deterministic split-leakage gate that proves patient disjointness by set arithmetic on the emitted split table. Integrates with MONAI / nnU-Net — does not replace them. |
455455
| **model-scaffold** | Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. Emits a patient-level seed-locked split as an auditable artifact, a configurable U-Net, train/evaluate scripts that seed every RNG and infer under eval mode, a config, requirements, a reproducibility record, and a Methods stub with VERIFY placeholders (no fabricated numbers). Reproducibility holds by construction; ships a `check_training_hygiene` AST gate + a network-free build→validate challenge. Integrates with MONAI / nnU-Net / TorchIO — does not reimplement them. |
456456
| **architecture-zoo** | "Which architecture for which research question" decision tool: maps task (classification / segmentation / detection / transfer), modality, data scale, and class imbalance to a paper-grounded architecture shortlist. Curates the foundational curriculum (ResNet / DenseNet / EfficientNet / ViT / Swin; U-Net / 3-D U-Net / Attention & Residual U-Net / nnU-Net / Mask R-CNN; SAM/MedSAM / TotalSegmentator / BiomedCLIP / DINO / MAE / SimCLR) — each with core idea, when-to-use, medical-imaging use, reference implementation, validation setup, and the matching model-scaffold template. Advisory; teaches archetypes, not a live SOTA leaderboard. |
457+
| **model-card** | Generate the documentation an engineer-built medical-imaging model must carry — a Model Card (Mitchell et al. 2019), a Datasheet for its dataset (Gebru et al. 2021), and a METRIC-informed data-quality pass — filled from user-supplied facts (never fabricated), then verify every required section is present and non-empty with a deterministic completeness gate (`check_model_card_complete`). Model Card / Datasheet are documentation standards vendored as templates, not counted reporting checklists. |
457458
| **intake-project** | Classifies new research projects, summarizes current state, identifies missing inputs, and recommends next steps. |
458459
| **grant-builder** | Structures grant proposals: significance, innovation, approach, milestones, and consortium roles. |
459460
| **present-paper** | Academic presentation preparation: paper analysis, supporting research, speaker scripts, slide note injection, and Q&A prep. |

docs/skills/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ One reference page per skill, generated from each skill's `SKILL.md` and `skill.
3333
- [manage-project](manage-project.md) — Research project management for medical manuscripts. _(evidence: manual_workflow)_
3434
- [manage-refs](manage-refs.md) — Cross-cutting reference manager for medical manuscripts. _(evidence: bundled_script)_
3535
- [meta-analysis](meta-analysis.md) — Systematic review and meta-analysis pipeline for medical research. _(evidence: demo)_
36+
- [model-card](model-card.md) — Generate the documentation an engineer-built medical-imaging model must carry — a Model Card (Mitchell et al. _(evidence: ci_validator)_
3637
- [model-scaffold](model-scaffold.md) — Generate a reproducible, runnable PyTorch training repo for a medical-imaging segmentation task — the missing middle link between choosing an architecture and validating a trained model. _(evidence: ci_validator)_
3738
- [model-validation](model-validation.md) — Design or audit the clinical-validation study for an engineer-built medical-imaging model (segmentation, classification, or detection) before the validation report or manuscript is written. _(evidence: ci_validator)_
3839
- [orchestrate](orchestrate.md) — General-purpose research orchestrator. _(evidence: demo)_

docs/skills/model-card.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
<!-- AUTO-GENERATED from skills/model-card/SKILL.md by scripts/gen_skill_docs.py. Do not edit by hand. -->
2+
3+
# model-card
4+
5+
> Generate the documentation an engineer-built medical-imaging model must carry — a Model Card (Mitchell et al. 2019), a Datasheet for its dataset (Gebru et al. 2021), and a METRIC-informed data-quality pass — filled from user-supplied facts, then verify every required section is present and non-empty before the card ships to a repo, Hugging Face card, or manuscript supplement. Never fabricates numbers, provenance, consent, or licence; unfilled fields stay flagged. Ships a deterministic completeness gate. Model Card and Datasheet are documentation standards vendored here as templates, not counted reporting checklists.
6+
7+
**Invoke:** `/model-card` · **Tools:** Read, Write, Edit, Bash, Grep, Glob · **Model:** inherit
8+
9+
## When to use
10+
11+
`model-card` activates on requests such as: model card, model cards, datasheet, datasheet for datasets, dataset documentation, model documentation, hugging face card, model metadata, intended use, out-of-scope, data quality, METRIC framework, model reporting, document a model.
12+
13+
## Quality Card
14+
15+
**Purpose** — Produce an auditable Model Card + Datasheet so an engineer-built model carries its intended-use, out-of-scope, training-data, per-subgroup-performance, and limitations record into clinical evaluation and publication — with a deterministic gate that no required section is missing or left as an unfilled placeholder.
16+
17+
**Safety boundaries**
18+
19+
- Templates are filled only from user-supplied facts; an empty required field stays [NEEDS INPUT] and is flagged, never auto-filled or guessed.
20+
- Completeness is reproduced by a stdlib script; it checks presence, not the truth of a stated fact (that is model-validation / check-reporting).
21+
22+
**Known limitations**
23+
24+
- Documents what is supplied; it cannot verify that a stated performance number or provenance claim is real.
25+
- Model Card / Datasheet are documentation standards, not clinical reporting guidelines — they are vendored as templates here, not counted reporting checklists.
26+
27+
**Validation**
28+
29+
- `python3 scripts/check_model_card_complete.py --card MODEL_CARD.md --datasheet DATASHEET.md --strict`
30+
- `bash scripts/check_model_card_complete_challenge/verify.sh # deterministic, network-free`
31+
32+
**Evidence**`ci_validator`
33+
34+
## Bundled resources
35+
36+
**References** (`skills/model-card/references/`):
37+
38+
- `datasheet_template.md`
39+
- `metric_dimensions.md`
40+
- `model_card_template.md`
41+
42+
**Scripts** (`skills/model-card/scripts/`):
43+
44+
- `check_model_card_complete.py`
45+
- `check_model_card_complete_challenge/` (5 files)
46+
47+
## Source
48+
49+
Canonical definition: [`skills/model-card/SKILL.md`](../../skills/model-card/SKILL.md)
50+
51+
---
52+
53+
*Part of [MedSci Skills](../../README.md) — Claude Code skills for the medical research lifecycle. This page is generated from the skill's `SKILL.md`; edit that file and re-run `scripts/gen_skill_docs.py`.*

metadata/catalog_counts.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
{
22
"_comment": "Single source of truth for catalog counts cited in public docs (README, orchestrate, check-reporting). scripts/validate_catalog_consistency.py recomputes every value from disk, asserts this file matches, and asserts the doc claims match. Do not hand-edit a value without running that script \u2014 CI fails on drift.",
3-
"skills": 48,
3+
"skills": 49,
44
"reporting_guidelines": 38,
55
"journal_profiles_find": 73,
66
"journal_profiles_write": 55,
7-
"integrity_detectors": 38
7+
"integrity_detectors": 39
88
}

metadata/detectors_catalog.json

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"_comment": "AUTO-GENERATED by scripts/gen_detectors_catalog_json.py from the analysis-integrity detectors under skills/*/scripts/ (same glob as validate_catalog_consistency.py). Machine-readable registry of the MedSci-Audit detector suite (single source of truth). Do not hand-edit; CI gate: python3 scripts/gen_detectors_catalog_json.py --check.",
3-
"detector_count": 38,
3+
"detector_count": 39,
44
"families": [
55
{
66
"key": "numerical_cohort",
@@ -57,6 +57,7 @@
5757
"check_citation_order",
5858
"check_disclosure_availability",
5959
"check_framework_naming",
60+
"check_model_card_complete",
6061
"check_prisma_figure",
6162
"check_summary_box",
6263
"check_supplement_hygiene",
@@ -197,6 +198,13 @@
197198
"family_label": "Style & review-process integrity",
198199
"description": "Generated-code quality gate for analysis scripts (analyze-stats Phase 3.5)."
199200
},
201+
{
202+
"id": "check_model_card_complete",
203+
"skill": "model-card",
204+
"family": "reporting_compliance",
205+
"family_label": "Reporting compliance",
206+
"description": "Model Card / Datasheet completeness gate (model-card)."
207+
},
200208
{
201209
"id": "check_null_calibration",
202210
"skill": "self-review",

0 commit comments

Comments
 (0)