arXiv | HuggingFace | Blog | Cite | License
Patho-Bench is a Python library designed to benchmark foundation models for pathology.
This project was developed by the Mahmood Lab at Harvard Medical School and Brigham and Women's Hospital. This work was funded by NIH NIGMS R35GM138216.
Note
Please report any issues on GitHub and contribute by opening a pull request.
- Reproducibility: Canonical train-test splits for 42 tasks across 20 public datasets.
- Evaluation frameworks: Supports linear probing, prototyping (coming soon), retrieval, Cox survival prediction, and supervised fine-tuning.
- Scalability: Scales to thousands of experiments with automatic GPU load-balancing.
- February 2025: Patho-Bench is public!
- Create a virtual environment, e.g.,
conda create -n "pathobench" python=3.10
, and activate itconda activate pathobench
. git clone https://github.com/mahmoodlab/Patho-Bench.git && cd Patho-Bench
.- Install dependencies (including trident)
pip install -r requirements.txt
- Local install with running
pip install -e .
.
Additional packages may be required if you are loading specific pretrained models. Follow error messages for additional instructions.
Note
Patho-Bench works with encoders implemented in Trident; use Trident to extract patch embeddings for your WSIs prior to running Patho-Bench.
Note
Patho-Bench automatically downloads supported tasks from our HuggingFace repo. Our provided HuggingFace splits only include train and test assignments, not validation. If you want to use a validation set, you can manually reserve a portion of the training set for validation (val
) after downloading the split. Note that some tasks have a small number of samples, which may make a validation set impractical.
If you want to use custom splits, format them similarly to our HuggingFace splits.
Patho-Bench supports various evaluation frameworks:
linprobe
➡️ Linear probing (using pre-pooled features)coxnet
➡️ Cox proportional hazards model for survival prediction (using pre-pooled features)protonet
➡️ Prototyping (using pre-pooled features) (Coming soon!)retrieval
➡️ Retrieval (using pre-pooled features)finetune
➡️ Supervised finetuning or training from scratch (using patch features)
Patho-Bench can be used in two ways:
- Basic: Importable classes and functions for easy integration into custom codebases
- Advanced: Large-scale benchmarking using automated scripts
Running any of the evaluation frameworks is straightforward (see example below). Define general-purpose arguments for setting up the experiment and framework-specific arguments. For a detailed introduction, follow our end-to-end tutorial.
from patho_bench.ExperimentFactory import ExperimentFactory # Make sure you have installed Patho-Bench and this imports correctly
model_name = 'titan'
train_source = 'cptac_ccrcc'
task_name = 'BAP1_mutation'
# Initialize the experiment
experiment = ExperimentFactory.linprobe( # This is linear probing, but similar APIs are available for coxnet, protonet, retrieval, and finetune
model_name = model_name,
train_source = train_source,
test_source = None, # Leave as default (None) to automatically use the test split of the training source
task_name = task_name,
patch_embeddings_dirs = '/path/to/job_dir/20x_512px_0px_overlap/features_conch_v15', # Can be list of paths if patch features are split across multiple directories. See NOTE below.
pooled_embeddings_root = './_test_pooled_features',
splits_root = './_test_splits', # Splits are downloaded here from HuggingFace. You can also provide your own splits using the path_to_split and path_to_task_config arguments
combine_slides_per_patient = False, # Only relevant for patient-level tasks with multiple slides per patient. See NOTE below.
cost = 1,
balanced = False,
saveto = './_test_linprobe'
)
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'macro-ovr-auc')
Note
Regarding the combine_slides_per_patient
argument: If True, will perform early fusion by combining patches from all slides in to a single bag prior to pooling. If False, will pool each slide individually and take the mean of the slide-level features. The ideal value of this parameter depends on what pooling model you are using. For example, Titan requires this to be False because it uses spatial information (patch coordinates) during pooling. If a model doesn't use spatial information, you can usually set this to True, but it's best to consult with model documentation.
Note
Provide patch_embeddings_dirs
so Patho-Bench knows where to find the patch embeddings for pooling. While Trident
also supports pooling, it doesn't handle patient-level tasks with multiple slides per patient. Patho-Bench uses a generalized pooling function for multi-slide fusion. Patho-Bench requires Trident patch-level features, NOT slide-level features.
Want to do large-scale benchmarking? See instructions for advanced usage.
This work was funded by NIH NIGMS R35GM138216.
If you find our work useful in your research or if you use parts of this code, please consider citing the following papers:
@article{zhang2025standardizing,
title={Accelerating Data Processing and Benchmarking of AI Models for Pathology},
author={Zhang, Andrew and Jaume, Guillaume and Vaidya, Anurag and Ding, Tong and Mahmood, Faisal},
journal={arXiv preprint arXiv:2502.06750},
year={2025}
}
@article{vaidya2025molecular,
title={Molecular-driven Foundation Model for Oncologic Pathology},
author={Vaidya, Anurag and Zhang, Andrew and Jaume, Guillaume and Song, Andrew H and Ding, Tong and Wagner, Sophia J and Lu, Ming Y and Doucet, Paul and Robertson, Harry and Almagro-Perez, Cristina and others},
journal={arXiv preprint arXiv:2501.16652},
year={2025}
}