Enhance evaluation and reliability criteria for models

Clouddelta · web-flow · commit ec615e005fd7 · 2026-04-30T01:26:54.000-05:00
Expanded sections on evaluation, reliability, and deployment of biomedical foundation models, emphasizing task-centered assessment, distribution shift, and operational feasibility.
diff --git a/content/06.deployment.md b/content/06.deployment.md
@@ -2,13 +2,19 @@
 
 ### Evaluation
 #### Task-Centered Utility over Generic Benchmarks
-Evaluation of biomedical foundation models should be task-centered rather than model-centered. The central question is not whether a model scores highly on a generic leaderboard, but whether it improves performance on the biomedical task that actually matters, such as diagnosis, prognosis, survival prediction, retrieval, report generation, or biomarker discovery [@agrawal2025evaluation; @wornow2023shaky]. This requires clearly specifying the intended use case, the target population, the relevant comparator, and the consequences of model errors. Emerging clinical benchmarking paradigms make this point clearly: models excelling on general leaderboards do not necessarily perform best on clinical decision-making tasks, and there remains no single widely accepted benchmark for clinical utility [@sandmann2025deepseek].
+Evaluation of biomedical foundation models should be task-centered rather than model-centered. The central question is not whether a model scores highly on a generic leaderboard, but whether it improves performance on the biomedical task that actually matters, such as diagnosis, prognosis, survival prediction, retrieval, report generation, or biomarker discovery [@agrawal2025evaluation; @wornow2023shaky]. This requires clearly specifying the intended use case, the target population, the relevant comparator, and the consequences of model errors. Recent clinical benchmarking studies, including evaluations of general-purpose models on clinical decision-support tasks, illustrate the need for task-specific assessment rather than reliance on general-purpose leaderboards [@sandmann2025deepseek; @agrawal2025evaluation; @wornow2023shaky].
+
+#### Modality-Aware Evaluation
+
+Biomedical foundation models span heterogeneous modalities, including clinical text, electronic health records, radiology and pathology images, molecular structures, protein sequences, genomics, transcriptomics, perturbation screens, and single-cell measurements [@moor2023foundation; @agrawal2025evaluation]. Evaluation therefore cannot rely on a single benchmark format. Clinical language models may be assessed through question answering, summarization, and decision-support tasks; imaging models through classification, segmentation, retrieval, and external validation; EHR models through prognosis, phenotyping, and risk stratification; and molecular or protein models through structure prediction, binding prediction, variant effect prediction, target prioritization, or downstream experimental utility.
+
+This diversity means that biomedical foundation model evaluation should distinguish among clinical utility, biological validity, and experimental actionability. A model that performs well on retrospective clinical prediction may not generate useful biological hypotheses, while a model that captures molecular regularities may not be directly deployable in patient-facing settings.
 
 #### Moving Beyond Discriminative Metrics
-Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they are insufficient for high-stakes biomedical use. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility.
+Evaluation must also go beyond discrimination alone. Metrics such as AUROC, AUPRC, F1, concordance index, and segmentation overlap are informative, but they capture only part of model performance. AUROC summarizes how well a model ranks positive cases above negative cases across decision thresholds, whereas AUPRC is often more informative when clinically relevant positive cases are rare. Methodological reviews in medical imaging emphasize that proper assessment must also incorporate calibration, uncertainty, and other task-specific criteria. A model can appear accurate on average while still producing poorly calibrated or clinically misleading outputs [@kocak2025evaluationmetrics]. The choice of metric should therefore follow the downstream use case: prediction tasks require calibration and risk stratification, retrieval and generation tasks demand expert judgment of quality, and discovery-oriented applications necessitate reproducibility and biological plausibility.
 
 #### Addressing Benchmark Contamination
-Benchmark design directly impacts the validity of results. If test sets are contaminated by publicly available pretraining data, reported performance inevitably exaggerates true progress. Pathology benchmarking initiatives have addressed this concern by withholding test data and exposing only an automated evaluation pipeline, explicitly citing contamination risk as the rationale for not releasing test cohorts [@campanella2025clinicalbenchmark]. This issue is especially acute for foundation models, whose massive pretraining corpora make overlap difficult to rule out. Fair evaluation demands comparing foundation models against robust task-specific baselines and smaller domain-adapted models, strictly utilizing contamination-aware protocols [@campanella2025clinicalbenchmark; @agrawal2025evaluation]. Often, the most relevant comparator is not the largest available model, but the strongest clinically realistic alternative.
+Benchmark design directly impacts the validity of results. If test sets are contaminated by publicly available pretraining data, reported performance may exaggerate true progress because the model may partially recall previously seen cases, reports, images, or benchmark answers rather than generalize to genuinely unseen biomedical scenarios. Pathology benchmarking initiatives have responded to related validity concerns by using automated evaluation pipelines for external model assessment, reducing the need to publicly circulate sensitive test cohorts [@campanella2025clinicalbenchmark]. This issue is especially acute for foundation models, whose massive pretraining corpora make overlap difficult to rule out. Fair evaluation generally requires comparing foundation models against robust task-specific baselines, smaller domain-adapted models, and clinically realistic alternatives under contamination-aware protocols [@campanella2025clinicalbenchmark; @agrawal2025evaluation]. Often, the most relevant comparator is not the largest available model, but the strongest clinically realistic alternative.
 
 #### Prospective Assessment and the Cost of Expert Review
 Retrospective benchmarking should not be treated as the endpoint of evaluation. Evidence becomes substantially stronger when evaluation extends toward workflow-aware analysis and prospective study. A prospective trial of open-source melanoma AI demonstrated that while the algorithmic system had the potential to improve dermatologist decision-making in clinical practice, larger randomized studies remained necessary before routine adoption [@marchetti2023proveai]. Furthermore, human-in-the-loop evaluation introduces significant financial and temporal bottlenecks due to the high cost of expert clinician annotation. While rigorous expert assessment is indispensable, exploring scalable solutions—such as LLM-as-a-judge frameworks validated by human spot-checking—may become necessary to sustain thorough evaluation pipelines. Ultimately, the value of an AI system is established only when predictive performance is connected to usability, workflow fit, and patient-relevant outcomes [@rosen2025clinicalimplementation].
@@ -18,10 +24,10 @@ Retrospective benchmarking should not be treated as the endpoint of evaluation.
 Reliability is a stricter requirement than benchmark performance. In the biomedical setting, it concerns whether model outputs remain trustworthy when deployment conditions depart from the development setting, and whether failures are sufficiently visible, bounded, and manageable. This matters critically for foundation models, which are often promoted as flexible, cross-modal systems—a flexibility that paradoxically complicates validation and oversight [@moor2023foundation; @agrawal2025evaluation]. Reliability should therefore be understood as a combination of robustness under shift, rigorous uncertainty quantification, and protection against clinically consequential failure modes.
 
 #### Distribution Shift and Shortcut Learning
-A defining reliability problem in biomedicine is distribution shift. Clinical data vary across hospitals, scanners, laboratories, note-writing styles, demographic groups, and calendar time, frequently causing strong internal performance to degrade upon external deployment. Ong Ly et al. demonstrated across 13 medical datasets that apparent model performance was overestimated by up to 20% on average because models exploited hidden data-acquisition biases rather than disease-relevant signals [@ong2024shortcut]. Related findings in medical imaging indicate that fairness assessed in the original distribution can break down under external generalization, warning that subgroup reliability cannot be assumed [@yang2024limits]. Therefore, evaluation must explicitly examine subgroup stability and resistance to shortcut learning, rather than presuming that massive scale or domain pretraining automatically resolves these vulnerabilities.
+A defining reliability problem in biomedicine is distribution shift. Clinical data vary across hospitals, scanners, laboratories, note-writing styles, demographic groups, and calendar time, frequently causing strong internal performance to degrade upon external deployment. Ly et al. [@ly2024shortcut] demonstrated across 13 medical datasets that apparent model performance was overestimated by up to 20% on average because models exploited hidden data-acquisition biases rather than disease-relevant signals [@ong2024shortcut]. Related findings in medical imaging indicate that fairness assessed in the original distribution can break down under external generalization, warning that subgroup reliability cannot be assumed [@yang2024limits]. Therefore, evaluation must explicitly examine subgroup stability and resistance to shortcut learning, rather than presuming that massive scale or domain pretraining automatically resolves these vulnerabilities.
 
 #### Uncertainty Quantification and Strategic Deferral
-Reliability inherently requires models to signal when they are likely to fail. In clinical settings, an incorrect answer delivered with high confidence is significantly more dangerous than no answer at all. Banerji et al. argue that clinical AI tools should communicate predictive uncertainty at the level of the individual patient, rather than relying solely on population-level summary statistics [@banerji2023uncertainty]. Similarly, frameworks for responsible deployment emphasize selective prediction, wherein models abstain, defer, or restrict outputs in settings where generalization is uncertain [@goetz2024generalization]. For biomedical foundation models, true reliability incorporates robust uncertainty quantification (UQ), abstention behaviors, and escalation pathways to human clinicians. A model unable to express uncertainty is not yet reliable enough for routine biomedical use.
+Reliability inherently requires models to signal when they are likely to fail. In clinical settings, an incorrect answer delivered with high confidence is significantly more dangerous than no answer at all. Banerji et al. argue that clinical AI tools should communicate predictive uncertainty at the level of the individual patient, rather than relying solely on population-level summary statistics [@banerji2023uncertainty]. Similarly, frameworks for responsible deployment emphasize selective prediction, wherein models abstain, defer, or restrict outputs in settings where generalization is uncertain [@goetz2024generalization]. For biomedical foundation models, true reliability incorporates robust uncertainty quantification (UQ), abstention behaviors, and escalation pathways to human clinicians. Models that cannot express uncertainty remain difficult to justify in routine high-stakes biomedical use.
 
 #### Generative Faithfulness and Safety
 These issues become acute for generative models. In tasks such as note summarization, report drafting, or clinical question answering, the central requirement is not surface-level plausibility, but factual faithfulness and clinical safety. Tam et al. reviewed 142 studies on human evaluation of healthcare LLMs, identifying substantial gaps in generalizability and advocating for standardized expert-centered assessment [@tam2024human]. More directly, Asgari et al. analyzed hallucinations and omissions in medical text summarization, finding that while omissions were more common, hallucinations constituted major errors with severe potential for downstream clinical harm [@asgari2025clinicalsafety]. Consequently, reliability in generative tasks cannot be established through automated metrics alone; it mandates structured failure-mode analysis and explicit assessment of potential patient harm.
@@ -33,6 +39,12 @@ Deployment is where the central question of this review becomes most concrete. E
 #### Workflow Integration and Human-AI Collaboration
 A primary requirement is that biomedical foundation models must actively support human work. Successful deployment depends on embedding the model into existing decision pathways without increasing cognitive or administrative burdens. A recent Nature Medicine study on decision support for colorectal cancer surgery utilized a stepwise design—moving from registry-based development to prospective clinical implementation with risk-tailored intervention bundles—explicitly linking predictions to actionable treatment workflows [@rosen2025clinicalimplementation]. Clinician-led deployment studies consistently show more meaningful impacts than technologist-led initiatives, underscoring the necessity of domain leadership [@li2025leadership]. Evaluative focus is shifting toward assessing human-AI collaboration through metrics like time savings and task completion rather than isolated accuracy [@wang2025leads].
 
+#### Deployment Beyond the Clinic: Closed-Loop Biomedical Discovery
+
+Deployment of biomedical foundation models is not limited to clinical decision support. In discovery-oriented settings, foundation models may be integrated into closed-loop experimental pipelines, where models propose hypotheses, prioritize molecules or perturbations, guide robotic experiments, and update predictions based on new assay results. These systems connect model inference with laboratory action, making deployment a problem of both computation and experimental operations.
+
+Closed-loop biomedical discovery introduces challenges that differ from clinical AI, including experimental batch effects, assay variability, robotic failure modes, reagent constraints, sample availability, cost-aware experimental design, and lab-specific protocols. Evaluation should therefore extend beyond retrospective prediction accuracy to include experimental yield, reproducibility across batches, reduction in search cost, feasibility of suggested experiments, and biological plausibility.
+
 #### Operational Feasibility and Regulatory Compliance
 Foundation models may appear theoretically attractive but practically infeasible due to infrastructure demands, latency, privacy constraints, and stringent regulatory requirements. Large multimodal systems require substantial computing power and complex interfacing with existing EHR, LIS, or PACS infrastructure. Utilizing external APIs may violate institutional data governance, while local deployment remains computationally expensive. Furthermore, deployment strategies must align with regulatory frameworks, such as the FDA's guidelines for Software as a Medical Device (SaMD) or equivalent international standards. Model updates create ongoing compliance challenges, as behavior shifts post-retraining require renewed local validation and regulatory review.