Reproducibility & MLSecOps Improvement Plan

Date: 2026-02-08 Branch: chore/final-housekeeping Scope: Open-ended characterization of further improvements for foundation-PLR Status: Enhancement roadmap (post-publication, not blocking handover)

Executive Summary

This document maps the current foundation-PLR security and reproducibility posture against the latest (Jan 2025 -- Feb 2026) developments in MLSecOps, clinical AI governance, and reproducible ML research. It synthesizes insights from:

10 seed papers (SecureAI-Flow, AIAppOps, Maria Platform, ML Monitoring MLR, event sourcing for reproducibility, biomedical AI reproducibility challenges, and more)
Clinical ML knowledge base (IEC 62304, ISO 13485/14971, FDA SaMD, QMS, RegOps)
UAD-copilot MLSecOps reference (CISO Assistant, SOC2, ISO 42001, model governance, Cards Trail)
2025-2026 web search (OWASP ML Top 10, model signing, EU AI Act, NIST AI RMF, ML-BOMs)

Current engineering maturity: strong -- the repo is publication-ready with excellent CI/CD, pre-commit gating, privacy architecture, and reproducibility pipeline. Clinical governance maturity is lower (no model card, no demographic reporting, no FMEA). The improvements below are aspirational enhancements for long-term maintenance and potential clinical translation.

Current State Assessment
Threat Landscape for Research ML Pipelines
Reproducibility Hardening
Supply Chain Security
Model & Data Governance
Clinical Translation Readiness
Monitoring & Drift Detection
Regulatory Alignment
Recommended GitHub Issues
References

1. Current State Assessment

What's Already Excellent

Domain	Status	Evidence	Key Gap
CI/CD pipeline	Excellent	5 workflows, tiered testing, Docker reproducibility	--
Pre-commit gating	Excellent	9 hooks: registry anti-cheat, computation decoupling, R hardcoding	No `nbstripout` for notebooks
Privacy architecture	Good	PLRxxxx anonymization, gitignore enforcement, compliance scanning	No git history audit, no re-identification risk assessment
Reproducibility	Good	Two-block architecture, Docker, lockfiles, frozen experiment configs	No external validation, no data versioning
Testing	Excellent	Tiered (unit/integration/e2e), figure QA, registry validation	--
Code quality	Excellent	ruff + mypy + guardrail tests	--
Dependency management	Excellent	Dependabot plan, transitive floor bumps, uv lockfile	--
Clinical governance	Weak	STRATOS metrics compliant	No model card, data card, FMEA, or demographic reporting

Gaps Identified

Gap	Priority	Effort
No `SECURITY.md` with vulnerability disclosure process	High	30 min
No formal Model Card for CatBoost classifier	High	1 hour
No formal Data Card for SERI PLR dataset	High	1 hour
No demographic bias assessment (age, sex reporting)	High	4 hours
`weights_only=True` NOT consistently used for `torch.load()`	High	2 hours
12+ bare `pickle.load()` calls without sandboxing	High	2 hours
No IRB/ethics approval documentation in governance	High	1 hour
No ML-BOM (ML Bill of Materials)	Medium	2 hours
No model artifact signing/provenance	Medium	4 hours
No data versioning (DVC or equivalent)	Medium	4 hours
No explicit branch protection rules on GitHub	Medium	15 min
No CODEOWNERS file	Low	10 min
No signed commits enforcement	Low	30 min
No runtime monitoring framework (drift detection)	Low	8+ hours

2. Threat Landscape for Research ML Pipelines {#2-threat-landscape}

OWASP ML Top 10 (v0.3, 2023) Applied to Foundation PLR

ID	Risk	Applicability	Current Mitigation
ML01	Input manipulation (adversarial examples)	LOW -- research pipeline, no live inference	N/A
ML02	Data poisoning	LOW -- fixed dataset from Najjar 2023	Data provenance documented
ML03	Model inversion	LOW -- no public API	Models not exposed
ML04	Membership inference	MEDIUM -- patient data	Anonymization (PLR→H/G codes)
ML05	Model theft	LOW -- academic, MIT license	Open source by design
ML06	Supply chain attacks	HIGH -- 50+ Python deps	Dependabot + lockfile
ML07	Transfer learning attack	MEDIUM -- MOMENT fine-tuning	Fixed checkpoints
ML08	Model skewing (train/serve mismatch)	LOW -- no serving	N/A
ML09	Output integrity attack	LOW -- local pipeline	N/A
ML10	Model poisoning	LOW -- own training only	MLflow tracking

Primary threat: ML06 (Supply Chain). The 2025 CVEs in MLflow validate this concern:

CVE-2025-11200 (auth bypass, CVSS 8.1): fixed in 2.22.0
CVE-2025-11201 (directory traversal RCE): fixed in 2.17.2
CVE-2025-52967 (SSRF): fixed in 3.1.0 (NOT covered by pyproject.toml floor of >=2.22.4)

Our uv.lock resolves mlflow to 3.9.0 which covers all three CVEs. However, the pyproject.toml floor of >=2.22.4 does not protect against CVE-2025-52967 if a resolver were to pick a version <3.1.0. Consider raising the floor to >=3.1.0.

Secondary threat: Pickle deserialization. Multiple torch.load() calls lack weights_only=True (see Section 4.4) and there are 12+ bare pickle.load() calls across src/orchestration/, src/data_io/, src/stats/, src/viz/, and src/log_helpers/. The 542 MLflow pickle artifacts are the primary data transport format.

SecureAI-Flow Threat Model (Rahman & Biplob 2025)

Maps ML-specific threats to CI/CD stages:

Stage	Threat	Foundation PLR Status
Data collection	Data poisoning	Fixed dataset, provenance documented
Training	Metric spoofing	MLflow tracks all runs, registry validates
Model storage	Model tampering	No signing yet (gap)
Deployment	Drift exploitation	No monitoring yet (gap, but no deployment)
Dependencies	Supply chain compromise	Dependabot + uv lockfile + floor bumps

3. Reproducibility Hardening {#3-reproducibility-hardening}

3.1 Sources of Irreproducibility (Han 2025)

Han (2025, BMC Medical Genomics) identifies 5 categories of irreproducibility in biomedical AI. Current status for foundation-PLR:

Source	Risk Level	Current Mitigation	Improvement
Random seeds	MEDIUM	Seeds set in config, but not all paths documented	Document ALL seed locations in a reproducibility checklist
Hardware non-determinism	LOW	Docker containers standardize environment	Add `torch.use_deterministic_algorithms(True)` option
Data variations	LOW	Fixed dataset, frozen configs	Add data hashing to verify dataset integrity
Preprocessing randomness	LOW	Deterministic pipeline (DuckDB queries)	Already mitigated by two-block architecture
Framework differences	LOW	Pinned via uv.lock	Add environment card documenting exact versions

3.2 Event Sourcing for Reproducibility (Beber 2025)

Beber (2025) proposes event sourcing (immutable event logs) for perfect replication:

Already implemented (analogous):

Git history = immutable event log of code changes
MLflow = immutable event log of experiment runs
Config versioning system = content-hashed configs with frozen protection

Could improve:

Standards compliance: PMML/ONNX model export for interoperability (Rahrooh 2023)
Contribution tracking: ORCID integration for multi-author provenance

3.3 Reproducibility Framework (Desai et al. 2025)

Desai et al. (2025, AI Magazine) propose a taxonomy:

Level	Definition	Foundation PLR Status
Repeatability	Same team, same setup, same result	ACHIEVED (Docker + lockfiles)
Dependent reproducibility	Different team, same code/data	ACHIEVABLE (public repo + demo data)
Independent reproducibility	Different team, different implementation	PARTIALLY (frozen configs document choices)
Direct replicability	Same method, new data	NOT YET (no external dataset tested)
Conceptual replicability	Same concept, different method	OUT OF SCOPE

Recommendation: Document the reproducibility level explicitly in a REPRODUCIBILITY.md or Model Card.

3.4 NeurIPS 2025 Reproducibility Checklist

The NeurIPS checklist provides a publication-ready standard:

Code availability (public GitHub repo)
Data availability (SERI dataset reference, demo data)
Computing infrastructure documented (Docker, GPU requirements)
Statistical significance (bootstrap CIs)
Experiment repetition details (1000 bootstrap iterations)
Random seed documentation
Hyperparameter search details (Optuna/hyperopt configs)
Train/val/test split methodology (patient-level, documented)

Status: Most items covered. Gap: formal checklist document linking to evidence.

4. Supply Chain Security {#4-supply-chain-security}

4.1 ML Bill of Materials (ML-BOM)

Two competing standards have matured in 2025:

Standard	ML Support	Maturity	Tooling
CycloneDX v1.5+	ML-BOM profile (models, datasets, training configs)	Production	`cyclonedx-bom` CLI
SPDX 3.0.1	AI/Dataset profiles	Production	`spdx-tools`
OWASP AI-BOM	Comprehensive (data + model + infra)	v0.1 (Nov 2025)	Early tooling

Recommended for foundation-PLR: CycloneDX ML-BOM because:

Captures model architecture, training data provenance, preprocessing pipeline
Machine-readable (JSON/XML)
Compatible with CISA 2025 SBOM minimum elements
cyclonedx-py generates Python dependency SBOMs directly from pyproject.toml

4.2 Model Signing (OpenSSF OMS v1.0, April 2025)

The OpenSSF Model Signing specification reached v1.0 in April 2025. NVIDIA now signs all NGC models; Google is prototyping on Kaggle Hub.

How it works:

Hash model contents (weights, configs, tokenizers)
Sign with Sigstore (keyless) or PKI
Store detached signature in OMS format
Log signing event in Rekor transparency log
Verify at load time

For foundation-PLR: Sign MLflow artifacts and DuckDB checkpoint databases:

uv add model-signing
# Sign a model artifact
model-signing sign --model mlruns/253031330985650090/.../artifacts/model
# Verify before loading
model-signing verify --model mlruns/253031330985650090/.../artifacts/model

4.3 OpenSSF MLSecOps Whitepaper (August 2025)

The OpenSSF published a comprehensive whitepaper mapping security controls to ML pipeline stages using SLSA, Sigstore, and OpenSSF Scorecard. Key recommendations:

Stage	Control	Foundation PLR Applicability
Data ingestion	Data provenance, integrity checks	Dataset hash verification
Training	Experiment tracking, reproducible builds	MLflow + Docker
Model storage	Signing, access control	Model signing (gap)
CI/CD	SLSA provenance, dependency scanning	Dependabot + lockfile
Deployment	Runtime monitoring, model verification	Not deployed (N/A for now)

4.4 ModelScan (Protect AI)

Scans ML model files for unsafe code (pickle deserialization attacks, embedded scripts):

uv add modelscan
modelscan scan -p path/to/model.pkl

Relevant for: Any torch.load() or pickle.load() in the pipeline. WARNING: Contrary to project documentation, weights_only=True is NOT consistently used:

src/anomaly_detection/momentfm_outlier/moment_io.py: torch.load() without weights_only
src/anomaly_detection/units/units_outlier.py: torch.load() without weights_only
src/classification/tabpfn/model/loading.py: weights_only=None (defaults to False)
src/classification/tabpfn_v1/scripts/model_builder.py: torch.load() without weights_only
12+ bare pickle.load() calls across src/ without sandboxing

ModelScan provides defense-in-depth by scanning serialized files for embedded code before loading.

5. Model & Data Governance {#5-model--data-governance}

5.1 Model Cards (Increasingly Mandated)

Model cards have evolved from a Google research proposal (Mitchell et al. 2019) to effectively required by regulation, though the "model card" format itself is not prescribed:

EU AI Act (August 2025): GPAI providers must publish technical documentation (Annex IV) -- legally binding
FDA SaMD draft guidance (January 2025): Recommends model description, data lineage, bias analysis -- draft, not yet legally binding
NIST AI RMF (supplementary materials, 2025): Recommends model provenance and documentation -- voluntary framework

Recommended Model Card for CatBoost classifier:

# configs/governance/model_card.yaml
model_name: "CatBoost Glaucoma Classifier"
version: "1.0 (publication freeze)"
model_type: "Gradient Boosted Decision Tree (CatBoost)"
task: "Binary classification (glaucoma vs control)"
intended_use:
  primary: "Research benchmark for preprocessing effect evaluation"
  out_of_scope: "Clinical diagnosis without further validation"
training_data:
  source: "Najjar et al. 2023, Br J Ophthalmol"
  n_subjects: 208
  class_distribution: "152 control (73.1%) + 56 glaucoma (26.9%)"
  geographic_origin: "Singapore (SNEC)"
  demographic_note: "Single-center, predominantly Asian population"
evaluation_metrics:
  auroc: "0.913 [95% CI from 1000 bootstrap iterations]"
  calibration_slope: "See DuckDB essential_metrics table"
  calibration_intercept: "See DuckDB essential_metrics table"
  net_benefit: "See DCA curves at 5-20% thresholds"
  brier_score: "See DuckDB essential_metrics table"
clinical_context:
  disease: "Open-angle glaucoma"
  clinical_task: "Population screening (secondary analysis of PLR recordings)"
  clinical_workflow: "Not integrated into any clinical workflow (research only)"
  decision_threshold: "Not established; DCA curves available for threshold selection"
  intended_use_statement: >
    Research tool for evaluating effect of preprocessing choices on classification
    performance. NOT for clinical decision-making without further validation.
regulatory_status:
  fda: "Not submitted; research use only"
  ce_mark: "Not applicable"
  irb_ethics: "Original SERI study approval [reference needed from Najjar 2023]"
sample_size_adequacy:
  n_total: 208
  n_events: 56
  events_per_variable: "TBD based on final feature count"
  note: "Borderline for stable AUROC (Riley: ~100 events recommended)"
validation:
  internal: "Stratified bootstrap (1000 iterations)"
  external: "None -- single-center dataset (primary scientific limitation)"
  temporal: "None -- cross-sectional"
limitations:
  - "Single-center dataset (Singapore/SNEC only, predominantly Asian population)"
  - "Limited to handcrafted PLR features"
  - "Small sample size (N=208, 56 events) limits generalizability"
  - "No external validation cohort"
  - "Disease prevalence in dataset (26.9%) >> population prevalence (3.54%)"
  - "All reported metrics are analytical performance only (not clinical performance)"
known_failure_modes:
  - "Poor signal quality PLR recordings (high artifact percentage)"
  - "Patients on pupil-affecting medications (miotics, mydriatics, alpha-agonists)"
  - "Non-glaucomatous pupil abnormalities (Adie's, Horner's, third nerve palsy)"
ethical_considerations:
  - "Not validated for clinical use -- screening only, not diagnostic"
  - "Potential bias toward Asian populations (Singapore single-center)"
  - "Subject privacy protected via anonymization pipeline"
  - "Generalizability to other ethnicities/populations is unknown"
  - "Known physiological variations in pupil dynamics across populations"

5.2 Data Cards (Gebru et al. 2021)

# configs/governance/data_card.yaml
dataset_name: "SERI PLR Glaucoma Dataset (Subset)"
source: "Najjar et al. 2023, DOI: 10.1136/bjophthalmol-2021-319938"
collection_institution: "Singapore Eye Research Institute (SERI)"
subjects:
  total_preprocess: 507
  total_classify: 208
  controls: 152
  glaucoma: 56
data_format:
  raw: "DuckDB (SERI_PLR_GLAUCOMA.db)"
  processed: "DuckDB (foundation_plr_results.db)"
  features: "Handcrafted physiological PLR features"
ground_truth:
  outlier_masks: "Expert-annotated blink/artifact masks for 507 subjects"
  classification_labels: "Clinical diagnosis (glaucoma/control) for 208 subjects"
known_biases:
  geographic: "Singapore only (predominantly Asian population)"
  device: "Single pupillometer model"
  temporal: "Cross-sectional (single timepoint per subject)"
ethics:
  original_study_approval: "Reference: Najjar et al. 2023 (ethics approval details in original paper)"
  secondary_use: "Covered by original consent [verify with SERI DUA]"
  re_identification_risk: "Low (pseudonymized, no public individual-level outputs)"
privacy:
  anonymization: "Original PLRxxxx codes replaced with Hxxx/Gxxx codes"
  private_data: "data/private/ (gitignored), subject-level JSONs excluded"
  public_data: "Aggregate statistics only in data/public/"
  git_history: "Should be audited for any historical PLRxxxx leaks before making repo fully public"
demographics:
  age_distribution: "TBD -- document from original dataset"
  sex_distribution: "TBD -- document from original dataset"
  ethnicity: "Predominantly Chinese, Malay, Indian (Singapore population)"
  note: "Generalizability to non-Asian populations unknown"
splits:
  method: "Patient-level splitting (no data leakage)"
  validation: "Stratified bootstrap (1000 iterations)"

5.3 Extended Cards Trail (from Appendix-Cards + UAD-Copilot)

Full card taxonomy adapted for an academic biomedical ML repository. Sources: Appendix B of MCP Review, UAD-copilot Cards Trail methodology.

Essential Cards (Implement Pre-Publication)

Card	Purpose	Content Source	Effort
Data Card	Dataset provenance, demographics, biases, privacy	80% from `methods.tex` + `results.tex` (TRIPOD Item 20)	1 hr (mostly extraction)
Model Card	CatBoost architecture, STRATOS metrics, limitations	70% from `methods.tex` + `supplementary.tex` (Table S3)	1.5 hrs
Use Case Card	Clinical decision context, risk taxonomy, stakeholders	60% from `discussion.tex` (clinical translation section)	1 hr

Recommended Cards (Strengthen Reproducibility & Safety)

Card	Purpose	Content Source	Effort
Failure Notes	Edge cases, failure modes, mitigations	`discussion.tex` (limitations), `results.tex` (calibration caveats, EPV issues)	1 hr
Environment Card	Python/R/DuckDB versions, Docker digest, OS	`methods.tex` line 114-116, `pyproject.toml`, `renv.lock`	30 min
Reproducibility Checklist	NeurIPS-style seeds/configs/harness checklist	Existing Hydra configs + `supplementary.tex` (Table S1 hyperparams)	45 min
Saliency/XAI Card	Feature importance, SHAP with VIF caveats	`supplementary.tex` lines 199-233 (SHAP + VIF analysis)	1 hr

Optional Cards (Post-Publication or If Deployed)

Card	Purpose	When Needed	Effort
Feedback Card	Stakeholder corrections/overrides (Zhou et al. 2023)	If deployed clinically with HITL	2 hrs
AI Usage Card	Where/when AI used in pipeline (Luccioni et al. 2023)	Manuscript supplementary or audit	30 min
ESG Card	Compute costs, equity impact (Binns et al. 2023)	Grant proposals, equity assessment	2 hrs
Schema Evolution Card	DuckDB schema changes, config migrations	If pipeline is actively maintained	1 hr
Deployment Lineage Card	Git SHA + DVC tag + CI metadata	If artifacts are distributed	30 min
Prompt Card	Version control for LLM prompt templates	If using Claude for analysis generation	30 min

Why Feedback Cards Are Low Priority

Feedback Cards (Zhou et al. 2023) document stakeholder corrections, overrides, and continual learning feedback. They are essential for production HITL systems (e.g., Maria Platform's 81% approve/19% correct workflow). For foundation-PLR:

No deployment = no stakeholder feedback loop
Frozen model = no continual learning
Becomes relevant if: Model is deployed in a virtual glaucoma clinic with ophthalmologist review

5.4 Manuscript Content Reuse Map

Most card content already exists in the manuscript -- the effort is extraction and reformatting, not creation from scratch.

Card Field	Manuscript Source	Lines	Extraction Effort
Study population	`methods.tex`	12-20	Copy + reformat
Signal protocol	`methods.tex`	18	Copy
Ground truth limitations	`methods.tex`	29-33	Copy
Pipeline architecture	`methods.tex`	35-66	Summarize
11 outlier methods	`methods.tex`	39-50	Copy from registry
8 imputation methods	`methods.tex`	51-59	Copy from registry
STRATOS metrics	`methods.tex`	104-112	Copy
Software stack	`methods.tex`	114-116	Copy
Demographics (N=208)	`results.tex`	13-16	Copy (TRIPOD Item 20)
Calibration caveats	`results.tex`	106-113	Copy
DCA caveats	`results.tex`	121-128	Copy
Key AUROC results	`results.tex`	62-72	Extract numbers
Stability results	`results.tex`	178-185	Extract numbers
VIF/SHAP analysis	`supplementary.tex`	224-233	Summarize
Hyperparameter table	`supplementary.tex`	242-271	Copy
Age confounding	`discussion.tex`	71-73	Copy
Spectrum bias	`discussion.tex`	79-80	Copy
Single-annotator caveat	`discussion.tex`	77	Copy
Multimodal architecture	`discussion.tex`	50-58	Summarize
Clinical deployment context	`discussion.tex`	41-58	Summarize

Estimated total effort for all Essential + Recommended cards: ~6 hours (mostly copy-edit from manuscript, not writing from scratch).

5.5 Manuscript-to-Repository Transfer Plan

When the manuscript is accepted for publication, governance artifacts should be transferred to the code repository:

Manuscript (sci-llm-writer)          Code Repository (foundation_PLR)
─────────────────────────────        ──────────────────────────────────
methods.tex (pipeline desc)    →     configs/governance/model_card.yaml
results.tex (TRIPOD Item 20)  →     configs/governance/data_card.yaml
discussion.tex (limitations)  →     configs/governance/failure_notes.md
supplementary.tex (Table S1)  →     configs/governance/hyperparameters.yaml
results-ground-truth.json     →     data/public/verified_results.json
DATA-SOURCE-MAP.md            →     docs/DATA-SOURCE-MAP.md

Transfer protocol:

After acceptance, create a docs/manuscript/ directory with accepted PDF + supplementary
Extract governance YAML/MD cards from LaTeX source
Cross-reference card fields to manuscript section numbers for traceability
Add manuscript_doi: "10.xxxx/..." to all cards once published
Update README with publication reference

6. Clinical Translation Readiness {#6-clinical-translation-readiness}

6.1 FDA SaMD/AI Guidance (January 2025)

The FDA's Total Product Life Cycle (TPLC) approach for AI-enabled devices requires:

Requirement	Foundation PLR Status	Gap
Model description & architecture	Partial (code exists, no formal doc)	Create model card
Data lineage documentation	Good (two-block pipeline)	Formalize data card
Performance metrics tied to clinical claims	Excellent (STRATOS metrics)	None
Bias analysis & mitigation	Partial (acknowledged, not quantified)	Demographic analysis
Human-AI workflow documentation	N/A (research, not deployed)	Future
Post-market monitoring plan	N/A	Future
SBOM for all software components	Partial (pyproject.toml + lockfile)	Generate CycloneDX ML-BOM
Predetermined Change Control Plan	N/A (frozen for publication)	Future if model updates planned

6.2 IEC 62304 Alignment (Software Life Cycle)

Although foundation-PLR is academic research (not a medical device), IEC 62304 thinking helps structure documentation:

IEC 62304 Process	Foundation PLR Equivalent
Development planning	`docs/planning/` + CLAUDE.md
Requirements specification	Research question doc + STRATOS metrics
Design specification	ARCHITECTURE.md + two-block pipeline
Implementation	`src/` with pre-commit quality gates
Integration testing	Tiered CI/CD (unit → integration → e2e)
Verification	`pytest` + figure QA + registry validation
Risk management (ISO 14971)	Partial: data leakage prevention, no formal FMEA
Configuration management	Config versioning system + frozen configs
Problem resolution	GitHub Issues + Dependabot

Safety class assessment (open question -- depends on clinical workflow):

Class B (non-serious injury possible): Defensible if the tool is one of multiple screening mechanisms and a missed case still gets other screening opportunities
Class C (serious injury possible): If the tool is a first-line screener where a false negative means delayed treatment and progressive irreversible vision loss from untreated glaucoma

The determination depends on the software's role in the clinical workflow. This is a risk assessment question that should be formally resolved during any clinical translation effort.

6.3 Quality Management (QMS) Insights

From the Clinical ML QMS knowledge base:

Continuous Audit-Based Certification (CABC): Automated audits of MLOps artifacts > point-in-time audits. Foundation-PLR's CI/CD pipeline already provides this pattern.
Acceptance testing: The pytest tests/test_figure_qa/ suite is analogous to clinical acceptance testing -- verifying outputs meet quality standards before release.
Shared responsibility: Clearly document that foundation-PLR is a research tool, not a clinical product. Clinical deployment requires additional validation by healthcare institutions.

6.4 Maria Platform Lessons (Lopes et al. 2026)

The Maria Platform (production healthcare AI in Brazil) provides reference patterns:

Pattern	Maria Implementation	Foundation PLR Adaptation
Clean Architecture	Domain logic isolated from infra	Two-block pipeline (extraction vs analysis)
Event-Driven Architecture	Publish-subscribe for auditability	Git history + MLflow as event logs
Human-in-the-Loop	Clinician review + feedback	N/A (research) but documented for future
Shadow/canary deployment	Gradual rollout with auto-rollback	N/A
Golden set evaluation	Curated test cases	Ground truth subjects (507 with masks)

Key lesson: Maria's 81% approve / 19% correct / 0% reject rate shows that upstream engineering quality (pre-commit hooks, CI/CD gating) prevents catastrophic failures in production. Foundation-PLR's strict gating already follows this pattern.

7. Monitoring & Drift Detection {#7-monitoring--drift-detection}

7.1 ML Monitoring Landscape (Naveed 2025)

Naveed's systematic review (136 papers) identifies 5 monitoring goals:

Goal	Relevance to Foundation PLR
Detect model degradation (55.1%)	FUTURE: If deployed, track AUROC/calibration drift
Detect data quality issues (48.5%)	CURRENT: Pre-commit checks, figure QA
Ensure responsible ML (24.3%)	PARTIAL: Privacy enforced, fairness not quantified
Operational monitoring (23.0%)	LOW: Local pipeline, no latency/throughput concerns
Label scarcity handling (3.3%)	LOW: Fixed labeled dataset

7.2 What Monitoring Would Look Like for PLR

If foundation-PLR were deployed clinically:

# Hypothetical monitoring config
monitoring:
  data_drift:
    method: "Kolmogorov-Smirnov test on feature distributions"
    frequency: "daily"
    alert_threshold: 0.05
    features_tracked:
      - "constriction_velocity"
      - "redilation_t63"
      - "baseline_pupil_diameter"

  model_performance:
    method: "Rolling window AUROC on labeled subset"
    window_size: 100
    alert_threshold: 0.85  # Alert if AUROC drops below

  calibration_drift:
    method: "Calibration slope on rolling window"
    acceptable_range: [0.8, 1.2]

  fairness:
    method: "Demographic parity across age groups"
    groups: ["<40", "40-60", ">60"]

Tools referenced in literature: Evidently AI, WhyLabs, Alibi Detect, Great Expectations, Amazon SageMaker Model Monitor.

7.3 AIAppOps Framework (Jonsson et al. 2026)

Emphasizes monitoring as the unifying mechanism across the ML lifecycle:

Statistical verification (distribution tests, performance metrics)
Formal verification (invariant checking, constraint satisfaction)
Runtime verification (safety-critical AI extensions)

For foundation-PLR: The existing pre-commit + CI/CD pipeline is already a form of "build-time monitoring." Runtime monitoring becomes relevant only if the pipeline is operationalized.

8. Regulatory Alignment {#8-regulatory-alignment}

8.1 EU AI Act Timeline

Date	Milestone	Impact on Foundation PLR
Feb 2025	Prohibited practices apply	N/A (Article 2(6) research exemption: "AI systems exclusively for scientific research and development")
Aug 2025	GPAI obligations apply	N/A (not a GPAI provider)
Aug 2026	Full applicability	If commercialized: definitively high-risk (Annex III, Section 1a -- medical device Class IIa+ under MDR)

Key requirement if commercialized: Technical documentation making model development, training, and evaluation traceable. Foundation-PLR's frozen configs + MLflow tracking already provide this foundation.

8.2 NIST AI RMF (March 2025 Update)

The AI RMF four-function structure maps well:

Function	Foundation PLR Implementation
GOVERN	CLAUDE.md rules, pre-commit hooks, CI/CD gating
MAP	Research question document, STRATOS metrics scope
MEASURE	Bootstrap CIs, calibration metrics, DCA curves
MANAGE	Frozen configs, version control, Dependabot

8.3 Cross-Framework Compliance Matrix

Requirement	EU AI Act	FDA SaMD	NIST RMF	IEC 62304	Status
Model documentation	Required	Required	Recommended	Required	GAP: Need model card
Data documentation	Required	Required	Recommended	Required	GAP: Need data card
Risk assessment	Required	Required	Required	Required	PARTIAL: Documented limitations
Performance metrics	Required	Required	Required	Required	DONE: STRATOS compliant
Bias assessment	Required	Required	Recommended	N/A	GAP: Not quantified
Traceability	Required	Required	Required	Required	DONE: MLflow + git + DuckDB
Post-market monitoring	Required	Required	Recommended	Required	N/A (research)
SBOM/ML-BOM	Required	Required	Recommended	Recommended	GAP: No formal BOM
Version control	Required	Required	Required	Required	DONE: git + uv.lock + frozen
Human oversight	Required	Required	Recommended	N/A	N/A (research)
Clinical validation evidence	Required (Annex IV)	Required	N/A	N/A	GAP: No clinical study
Demographic subgroup analysis	Required	Required	Recommended	N/A	GAP: Not quantified

8b. Preliminary Failure Mode and Effects Analysis (FMEA)

If foundation-PLR were deployed clinically, these are the critical failure modes:

Failure Mode	Effect	Severity	Likelihood	Detection	Risk
False negative (missed glaucoma)	Delayed treatment, progressive irreversible vision loss	HIGH	Medium (AUROC 0.913)	LOW (patient may not return for screening)	HIGH
False positive (false alarm)	Unnecessary specialist referral, patient anxiety, cost	Low-Medium	Medium	HIGH (specialist exam reveals no disease)	MEDIUM
Signal quality failure (poor PLR)	Unreliable prediction on noisy input	Medium	Medium (artifact prevalence varies)	MEDIUM (if quality metrics checked)	MEDIUM
Calibration drift	Incorrect risk stratification if probabilities miscalibrated	HIGH	LOW (frozen model)	LOW (without monitoring)	MEDIUM
Prevalence mismatch	PPV/NPV invalid at population prevalence (3.54% vs 26.9% training)	HIGH	HIGH (deployment differs from training)	LOW	HIGH
Population mismatch	Unknown performance on non-Asian populations	HIGH	HIGH (if deployed outside Singapore)	LOW	HIGH

Note: This is a qualitative preliminary FMEA for planning purposes. Formal ISO 14971 risk management requires a multi-disciplinary risk assessment team.

8c. Additional Security Tools Not Yet Mentioned

Tool	Purpose	Applicability
gitleaks / trufflehog	Scan git history for secrets (hundreds of patterns vs our 5 regex)	HIGH -- supplement `compliance_check.py`
nbstripout	Strip Jupyter notebook outputs to prevent patient data leakage	HIGH -- add to pre-commit
pip-audit / uv audit	Local vulnerability scanning of Python dependency tree	MEDIUM -- complement Dependabot
SLSA (Supply-chain Levels for Software Artifacts)	Documented, hosted build process provenance	MEDIUM -- Level 1 is nearly free with GitHub Actions
Cosign	Sign Docker container images (complement to OMS for models)	LOW -- relevant if images are distributed
OWASP AI Testing Guide (2026)	Concrete test cases for ML systems	LOW -- more actionable than ML Top 10

8d. Low-Hanging Fruit Triage

Practical assessment of what's trivial vs what needs real work:

Trivial (< 30 min each, can do in one session)

Fix	Time	What to Do
SECURITY.md	15 min	Copy GitHub template, fill in email, supported versions, scope
CODEOWNERS	5 min	`* @petteriTeikari`
Branch protection on main	10 min	GitHub Settings → Branches → Add rule
Environment Card	30 min	Extract versions from `pyproject.toml`, `renv.lock`, Dockerfiles into YAML
`dependabot.yml`	10 min	Add weekly schedule for pip + npm ecosystems

Light Work (1-2 hours, mostly extraction from manuscript)

Fix	Time	What to Do
Reproducibility checklist	45 min	NeurIPS template + link to existing evidence (configs, Docker, lockfiles)
Data Card	1 hr	80% copy from `methods.tex` + `results.tex` → YAML
Model Card	1.5 hr	70% copy from manuscript → YAML + add clinical fields
Data integrity checksums	30 min	`sha256sum data/public/*.db > data/_checksums.sha256` + CI step
AI Usage Card	30 min	Document which pipeline stages use which methods

Moderate Work (2-4 hours)

Fix	Time	What to Do
Failure Notes	1 hr	Extract from `discussion.tex` limitations + add PLR-specific edge cases
Use Case Card	1 hr	Clinical decision context from `discussion.tex`
Demographic reporting	4 hrs	Query original dataset for age/sex distributions, compute subgroup AUROCs
ML-BOM generation	2 hrs	Install `cyclonedx-py`, generate SBOM, add `make sbom` target
Saliency/XAI Card	1 hr	Extract SHAP + VIF analysis from `supplementary.tex`

Defer / Low Priority

Item	Why Defer	Revisit When
Pickle safety (ModelScan, `weights_only`)	All pickles are internal MLflow artifacts from own training. No 3rd party pickles loaded. Shared reproducibility artifact is DuckDB (not pickle). Risk is theoretical for this use case.	If pipeline accepts external model artifacts or is deployed as a service
Model signing (Sigstore/OMS)	No model distribution. Artifacts are local.	If models are published to a registry or shared externally
Runtime monitoring	No deployment. Academic publication.	If clinical translation begins
DVC	Dataset is frozen. SHA256 checksums provide equivalent integrity for fixed data.	If dataset changes or multi-version experiments begin
Feedback Card	No HITL loop. Frozen model.	If deployed with clinician review

9. Recommended GitHub Issues {#9-recommended-github-issues}

Tier 1: High Priority (Pre-Handover or Early Post-Publication)

Issue A: Add Reproducibility Checklist Document

Labels: documentation, reproducibility Effort: 1 hour

Create REPRODUCIBILITY.md with NeurIPS-style checklist linking to evidence:

Code availability: GitHub public repo (MIT license)
Data availability: SERI dataset reference (Najjar 2023) + demo data
Computing infrastructure: Docker images documented, GPU requirements
Statistical significance: Bootstrap CIs (1000 iterations, 95% CI)
Random seed documentation: All seed locations listed with values
Hyperparameter search: Optuna/hyperopt configs in configs/CLS_HYPERPARAMS/
Train/test split: Patient-level, stratified (no data leakage)
Environment specification: Python 3.11, R 4.4+, exact versions in uv.lock/renv.lock
Reproducibility level: Dependent reproducibility (Desai et al. 2025 taxonomy)

Acceptance criteria: Document exists, all items checked or explicitly marked as gaps, linked from README.

Rationale: Semmelrock et al. (2025, AI Magazine) identify poor documentation as a primary reproducibility barrier. NeurIPS, ICML, and all major venues now require this. Highest ROI for peer review.

Issue B: Create Model Card and Data Card (Governance Documentation)

Labels: documentation, enhancement Effort: 2 hours

Create configs/governance/model_card.yaml and configs/governance/data_card.yaml following Mitchell et al. 2019 and Gebru et al. 2021 templates:

Model Card contents:

CatBoost classifier architecture, hyperparameters, HPO method
Training data demographics (N=208, Singapore, SNEC)
STRATOS evaluation metrics with bootstrap CIs
Limitations (single-center, small N, no external validation)
Ethical considerations (not for clinical use without validation)

Data Card contents:

SERI PLR dataset provenance (Najjar 2023)
Subject counts per task (507 preprocess, 208 classify)
Ground truth annotation methodology
Known biases (geographic, device, temporal)
Privacy architecture (anonymization, gitignore enforcement)

Additional fields to include (from clinical reviewer):

Clinical context (disease, workflow, decision threshold)
Regulatory status (research use only)
Sample size adequacy (Riley criteria, events per variable)
Known failure modes (poor signal quality, medications, non-glaucomatous pupil abnormalities)
IRB/ethics approval reference
Prevalence-adjusted PPV/NPV at population prevalence (3.54%)
Feature importance rankings (SHAP or CatBoost feature importance)
TRIPOD+AI checklist cross-references

Acceptance criteria: Both YAML files exist, all fields populated or explicitly marked "TBD", referenced from README.

Rationale: Model card content is effectively required by EU AI Act (Annex IV technical documentation) and recommended by FDA SaMD draft guidance. Even for academic work, they demonstrate rigor and facilitate future clinical translation.

Issue C: Demographic Reporting and Bias Assessment

Labels: documentation, fairness, enhancement Effort: 4 hours

Even with N=208, document and report:

Age and sex distributions for the full cohort (controls vs glaucoma)
Age-stratified AUROC (even if CIs are wide, point estimates are informative)
Explicit statement that dataset is single-center, single-ethnicity (Singaporean: Chinese, Malay, Indian)
Known physiological variations in pupil dynamics across populations (pupil size varies with iris pigmentation, age, medication)
Generalizability limitations documented in model card

Acceptance criteria: Demographics table in data card, at minimum 2 subgroup analyses reported (age-stratified, sex-stratified), limitations documented.

Rationale: FDA SaMD guidance (2025) requires test data representative of intended use populations. EU AI Act requires bias analysis for high-risk systems. Journals increasingly flag the absence of demographic reporting as a deficiency regardless of sample size. Even acknowledging that the dataset is too small for robust subgroup analysis is itself valuable documentation.

Tier 2: Medium Priority (Post-Publication Enhancements)

Issue D: Generate ML-BOM (Machine Learning Bill of Materials)

Labels: security, enhancement Effort: 2 hours

Generate a CycloneDX ML-BOM capturing:

Python dependency SBOM from pyproject.toml (via cyclonedx-py)
Model component metadata (CatBoost version, training config hash)
Dataset provenance (Najjar 2023 DOI, preprocessing pipeline version)
R dependency listing from renv.lock
npm dependencies from apps/visualization/package.json

Add make sbom target and CI step to regenerate on dependency changes.

Rationale: CISA 2025 Minimum Elements for SBOM now recommended for all software. FDA SaMD guidance requires SBOM. CycloneDX ML-BOM extends traditional SBOM to cover ML-specific components (models, datasets, training configs).

References:

Issue E: Implement Model Artifact Signing with Sigstore/OMS

Labels: security, enhancement Effort: 4 hours

Adopt OpenSSF Model Signing (OMS v1.0, April 2025) for model provenance:

Sign MLflow model artifacts at training completion
Sign DuckDB checkpoint database after extraction
Verify signatures before loading in analysis pipeline
Log signing events to Sigstore transparency log (Rekor)

uv add model-signing
# Sign: model-signing sign --model path/to/artifact
# Verify: model-signing verify --model path/to/artifact

Add make sign-models and make verify-models targets.

Rationale: NVIDIA signs all NGC models since March 2025. Google prototyping on Kaggle. OMS v1.0 is production-ready. Provides cryptographic proof that model artifacts haven't been tampered with -- essential for research integrity and future clinical translation.

References:

Issue F: Add Data Integrity Verification

Labels: enhancement, reproducibility Effort: 1 hour

For a frozen dataset, a lightweight checksum approach is more appropriate than full DVC:

Create data/_checksums.sha256 with SHA256 hashes of all data files
Add CI step: sha256sum -c data/_checksums.sha256 to verify integrity
Add make verify-data target
Document data provenance (Najjar 2023 DOI, extraction pipeline version) in data card

Note: Full DVC is recommended only if the dataset changes in future work. For the current frozen publication, checksums provide equivalent integrity guarantees with zero additional tooling.

Rationale: Code versioning (git) without data integrity verification leaves a reproducibility gap. SHA256 checksums are the simplest effective solution for fixed datasets.

Issue G: Document Pickle Safety Posture (Deferred Fix)

Labels: security, documentation Effort: 1 hour

Context: All pickle/torch.load calls in this repo load own MLflow artifacts from own training runs -- no 3rd party or untrusted pickles. The shared reproducibility artifact is DuckDB (not pickle). Risk is theoretical for this use case.

What to do now (documentation):

Document the pickle safety posture in SECURITY.md: "All serialized model artifacts are from internal training. No external pickles are loaded."
Add # SAFETY: internal artifact, no untrusted input comments to key torch.load() and pickle.load() calls

What to do later (if the threat model changes):

Add weights_only=True to torch.load() calls where possible
Add ModelScan for defense-in-depth
Audit all serialization formats

Rationale: Pickle deserialization is a real ML supply chain threat (OWASP ML06), but the actual risk for this repo is LOW because all artifacts are self-generated. Document the posture now; implement technical controls if/when external artifacts are introduced.

Issue H: Add SECURITY.md with Vulnerability Disclosure Process

Labels: documentation, security Effort: 30 minutes

Create a SECURITY.md following GitHub's standard template:

Supported versions (current publication version)
Reporting process (email, not public issues, for security vulnerabilities)
Response timeline (90-day responsible disclosure)
Known limitations (local pipeline, not production-hardened)

Acceptance criteria: File exists at repo root, linked from README.

Rationale: GitHub recommends this for all public repos. Signals security awareness to reviewers.

Issue I: GitHub Repository Hardening (Branch Protection)

Labels: security, devops Effort: 1 hour

Enable branch protection rules on main:
- Require PR review (1+ approval)
- Require CI pass before merge
- Dismiss stale reviews on new commits
- Restrict force pushes
Add CODEOWNERS file (assign @petteriTeikari to all paths)
Configure dependabot.yml for automated weekly security scans

Rationale: Prevents accidental force-push to main (already happened once with PR #47). Low effort, high value for a publication-frozen repo.

Tier 3: Low Priority (Long-Term / If Operationalized)

Issue J: Implement Runtime Monitoring Framework (If Deployed)

Labels: enhancement, monitoring Effort: 8+ hours

Prototype monitoring for potential clinical deployment:

Data drift detection (KS test on feature distributions)
Model performance tracking (rolling AUROC on labeled subset)
Calibration drift monitoring (slope in [0.8, 1.2] range)
Alert pipeline (Evidently AI or custom dashboard)
Document monitoring SLAs and escalation procedures

Rationale: Naveed (2025) systematic review shows monitoring is the most common gap in ML systems. Jonsson et al. (2026) AIAppOps framework positions monitoring as the unifying mechanism for compliance. Not needed for academic publication but essential for translation.

Issue K: Environment Card and Docker Digest Pinning

Labels: reproducibility, security Effort: 2 hours

Create configs/governance/environment_card.yaml:
- Python version, R version, Node.js version
- OS (Ubuntu version in Docker)
- GPU requirements (optional, for training)
- Docker image digest (SHA256)
Pin Docker base images with SHA256 digests (relates to existing GH#43)
Add CI check that environment card matches actual Docker build

Rationale: Han (2025) identifies hardware/software environment differences as a key source of irreproducibility. Environment cards (UAD-copilot Cards Trail) provide a formal record.

Issue Priority Matrix

Issue	Title	Tier	Effort	Primary Driver
A	Reproducibility checklist	1 (High)	1 hr	NeurIPS/peer review
B	Model Card + Data Card	1 (High)	2 hrs	EU AI Act, FDA SaMD draft, TRIPOD+AI
C	Demographic bias assessment	1 (High)	4 hrs	FDA SaMD, EU AI Act, journal expectations
D	ML-BOM generation	2 (Medium)	2 hrs	CISA 2025, SBOM best practice
E	Model signing (Sigstore/OMS)	2 (Medium)	4 hrs	OpenSSF, supply chain integrity
F	Data integrity verification	2 (Medium)	1 hr	Reproducibility
G	Document pickle safety posture	2 (Medium)	1 hr	OWASP ML06, documentation
H	SECURITY.md	2 (Medium)	30 min	GitHub best practice
I	GitHub repo hardening	2 (Medium)	1 hr	Prevent force-push accidents
J	Runtime monitoring (if deployed)	3 (Low)	8+ hrs	Clinical translation
K	Environment card + Docker digest	3 (Low)	2 hrs	Reproducibility

10. References {#10-references}

Seed Papers (from biblio)

Beber (2025). Total reproducibility via event sourcing in computational systems biology.
Denniss et al. (2025). Pupil melanopsin biomarker glaucoma study (reproducible clinical protocols).
Donkor et al. (2024). Computing in life sciences: algorithms to modern AI.
Han (2025). Challenges of reproducible AI in biomedical data science. BMC Medical Genomics.
Jonsson et al. (2026). AIAppOps: Socio-technical framework for data-driven AI operations.
Larrazabal et al. (2022). Mitigating bias in radiology ML. RSNA.
Lopes et al. (2026). Maria Platform: Clinical AI case study. CAIN '26.
Naveed (2025). ML monitoring systems: Multivocal literature review (136 papers).
Rahrooh et al. (2023). Interoperability and reproducibility framework (PREMIERE/PMML). J Biomed Inform.
Rahman & Biplob (2025). SecureAI-Flow: CI/CD security framework (multi-agent DevSecOps).

Standards & Frameworks

OWASP ML Security Top 10 v0.3 (2023). mltop10.info
OpenSSF Model Signing Spec v1.0 (April 2025). github.com/ossf/model-signing-spec
OpenSSF MLSecOps Whitepaper (August 2025). openssf.org
CycloneDX ML-BOM (2025). cyclonedx.org/capabilities/mlbom/
CISA 2025 Minimum Elements for SBOM. cisa.gov
SPDX 3.0.1 AI Profile (2025). spdx.dev
OWASP AI-BOM v0.1 (November 2025). owasp.org

Regulatory

EU AI Act (effective August 2024, full applicability August 2026). digital-strategy.ec.europa.eu
FDA Draft Guidance: AI-Enabled DSFs (January 2025). fda.gov
NIST AI RMF 1.0 (March 2025 update). nist.gov
NIST AI 600-1: GenAI Risk Profile (July 2024). nvlpubs.nist.gov
NIST IR 8596: Cyber AI Profile (December 2025 draft). nvlpubs.nist.gov
IEC 62304: Medical device software lifecycle processes.
ISO 14971: Medical devices -- risk management.
ISO/IEC 42001:2023: AI management system requirements.