Specialize & compress a small DINOv3 (ViT‑S/16) for the PCam histology dataset.
TL;DR:
- Strong PCam classifier from a DINOv3 ViT‑S/16 backbone (best: 0.980 AUROC / 0.920 Sens@95%Spec).
- Fine-tuning: LoRA matches near full fine‑tune; head-only training is not far behind.
- Pruning: Architecture-preserving compression (attention‑head & MLP pruning + per‑layer SVD) → ~7% fewer params in this pass, although getting a significant performance hit (AUROC decreasing by ~0.025).
- Quantization: Possible to bf16, halves memory footprint with no significant performance drop.
- Easily replicable with a Makefile and scripts (steps below).
- Tested on Apple M-series, GPU cluster (A100) with Slurm, logging with Weights & Biases.
We use AUROC as the main metric. All runs use 224×224 inputs and test-time augmentation (TTA) at evaluation.
| Method | Quantization | Parameters | GFLOPs | Memory (MB) | AUROC | AUPRC | Sens @95%Spec | ECE | Brier | Acc | NLL |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full fine‑tune (FT) | none (f32) | 21.60M | 8.67 | 86.5 | 0.9800 | 0.9820 | 0.9202 | 0.0209 | 0.0603 | 0.9202 | 0.2116 |
| - | bf16 | 21.60M | 8.67 | 43.3 (↓50.0%) | 0.9800 | 0.9820 | 0.9195 | 0.0215 | 0.0602 | 0.9200 | 0.2115 |
| LoRA (r=8; attn+MLP adapters) | none (f32) | 21.60M | 8.67 | 86.5 | 0.9746 (↓0.6%) | 0.9786 | 0.9091 | 0.0148 | 0.0631 | 0.9190 | 0.2229 |
| - | bf16 | 21.60M | 8.67 | 43.3 (↓50.0%) | 0.9746 (↓0.6%) | 0.9785 | 0.9082 | 0.0146 | 0.0631 | 0.9183 | 0.2230 |
| Linear probe (head-only) | none (f32) | 21.60M | 8.67 | 86.5 | 0.9714 (↓0.9%) | 0.9742 | 0.8808 | 0.0186 | 0.0654 | 0.9131 | 0.2274 |
| - | bf16 | 21.60M | 8.67 | 43.3 (↓50.0%) | 0.9714 (↓0.9%) | 0.9742 | 0.8793 | 0.0177 | 0.0659 | 0.9122 | 0.2289 |
| Full FT + compression (heads+MLP+SVD, τ=[0.89, 0.975, 0.975]) | none (f32) | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 80.4 (↓7.0%) | 0.9537 (↓2.7%) | 0.9630 | 0.8452 | 0.0408 | 0.1046 | 0.8553 | 0.3385 |
| - | bf16 | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 40.3 (↓53.4%) | 0.9532 (↓2.7%) | 0.9628 | 0.8439 | 0.0390 | 0.1046 | 0.8545 | 0.3388 |
| LoRA + compression (same τ) | none (f32) | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 80.4 (↓7.0%) | 0.9513 (↓2.9%) | 0.9607 | 0.8423 | 0.0322 | 0.1064 | 0.8527 | 0.3445 |
| - | bf16 | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 40.3 (↓53.4%) | 0.9521 (↓2.8%) | 0.9612 | 0.8451 | 0.0321 | 0.1058 | 0.8538 | 0.3427 |
| Head-only + compression (same τ) | none (f32) | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 80.4 (↓7.0%) | 0.9307 (↓5.0%) | 0.9314 | 0.6870 | 0.0230 | 0.1046 | 0.8577 | 0.3452 |
| - | bf16 | 20.08M (↓7.0%) | 8.06 (↓7.0%) | 40.3 (↓53.4%) | 0.9316 (↓4.9%) | 0.9323 | 0.6889 | 0.0226 | 0.1038 | 0.8591 | 0.3427 |
GFLOPs estimated with ptflops; memory footprint reflects the state dict size.
Takeaways:
- Full FT still leads (AUROC 0.980, Sens@95%Spec 0.920), but LoRA trails by only 0.005 AUROC with tighter calibration (ECE 0.0148 vs 0.0209).
- bf16 checkpoints halve memory requirements to ~43 MB with no measurable hit to accuracy, calibration, or sensitivity across the board.
- Linear probe remains a competitive lightweight option at 0.971 AUROC / 0.913 Acc, though sensitivity at high specificity lags the LoRA and full FT runs.
- Compression trims ~7% params/FLOPs but costs around 0.025 AUROC.
Note: For LoRA and Linear probe, we also allowed training of the backbone layer norms and biases.
- Backbone & classifier:
DinoV3Backbonewraps thefacebook/dinov3-vits16-pretrain-lvd1689mencoder and feeds theDinoV3PCamhead (src/models/backbone_dinov3.py). - Data pipeline:
src/data/pcam_hf.pyloads PCam HDF5 splits with Hugging Face preprocessing, andsrc/utils/data_utils.pyapplies histology-friendly flips, 90° rotations, and color jitter. - Training orchestration:
src/train/finetune.pydrives head-only probes, LoRA adapters, and full fine-tunes with cosine decay + warmup, Weights & Biases logging, and optional backbone LayerNorm/bias updates. - LoRA adapters:
src/models/lora.pyexposes the modularLoRALinearlayers plus merge/unmerge helpers so adapters can be toggled at evaluation or export time. - Compression stack:
src/train/pruning.pyimplements three types of pruning: attention-head, MLP-unit, and truncated-SVD (see Appendix) and can chain them before optional quantization. - Quantization & footprint:
src/utils/memory_utils.pyestimates memory footprint and supports bf16 quantization. - Evaluation & TTA:
src/utils/eval_utils.pycomputes AUROC/AUPRC, calibration, sensitivity; heavy evaluations default to test-time augmentation (flips + 90° rotations) to reduce orientation sensitivity, which stabilizes calibration on histology patches. - Reporting:
src/utils/roc_plot.pyplots ROC curves from exported probabilities, and evaluation runs append predictions toreports/results_probs.csvfor downstream analysis.
Works on Apple MPS (except bf16 quantization) or a GPU cluster (A100). Uses Python 3.12+ with uv.
Prerequisites:
- Access to the DINOv3 weights on Hugging Face and a
huggingface-cli login. - If using a cluster (SLURM), adjust a few cluster specific variables at the top of the
Makefileto suit your environment.
# 0) Setup
uv sync # install deps
make get-data # download PCam into src/data/pcam
# 1) Baselines (choose method: head_only | lora | fullft)
# Local: make baseline
# Cluster (SLURM): make sbaseline
make baseline METHOD=head_only EPOCHS=8 # linear probe
make baseline METHOD=lora # LoRA (q/k/v/o + MLP)
make baseline METHOD=fullft # full fine‑tune
# 2) Evaluate + compress a saved checkpoint
# Local: make eval
# Cluster (SLURM): make seval
# CHECKPOINT: absolute path or relative to checkpoints/saved/
# (methods: attention_heads, mlp_neurons, truncated_svd; combine with commas)
make eval CHECKPOINT=lora.pt \
PRUNE_TARGETS=all \
PRUNE_METHOD=attention_heads,mlp_neurons,truncated_svd \
PRUNE_AMOUNT=attention_heads=0.89,mlp_neurons=0.975,truncated_svd=0.975
# Exports per-sample predictions to reports/results_probs.csv,
# Optional: append QUANTIZE=bf16 to evaluate with BF16 weights (supported devices only).Numerous hyperparameters/settings are available (LoRA r, alpha, dropout; different learning rates for different parts of the network; label smoothing, grad clip...), see the makefile and the appendix for a full list.
In addition of W&B logging, all evaluation runs append per-sample probabilities to reports/results_probs.csv for offline analysis.
Steps to reproduce the results:
uv sync
make get-data
# Change cluster specific variables at the top of the makefile
bash scripts/repro_finetuning.sh --wandb # remove flag for no W&B logging
# Place the checkpoints in checkpoints/saved/ with names head_only.pt, lora.pt, fullft.pt
bash scripts/repro_pruning.sh --wandb # remove flag for no W&B logging- PatchCamelyon (PCam) is a binary classification dataset built from small patches extracted from histopathology whole-slide images (WSIs) of lymph node sections. Each patch is labeled positive if tumor (metastatic breast carcinoma) tissue is present in the central 32×32 px region; otherwise negative.
- Size & shape: 327,680 RGB patches of size 96×96 pixels.
- Splits: Train 262,144, val 32,768, test 32,768, each 50/50 class-balanced with no WSI overlap between splits to avoid leakage.
- Patches are sampled from CAMELYON16 WSIs. Slides were scanned at 40× (≈0.243 μm/px) across two centers and downsampled to 10× for patch selection.
- No leakage by construction. We keep the PCam split guarantee (no overlapping WSIs across train/val/test).
- Histology-friendly augmentations: we use flips, 90° rotations, and light color jitter; we enable test-time augmentation (TTA).
- Resolution. Our models train/evaluate at 224×224; PCam patches are upsampled from 96×96 in the preprocessing pipeline.
- Hardware tested: MacBook Air M4 (10C CPU / 10C GPU / 24GB RAM) and Slurm cluster with A100‑40GB.
- Frameworks: PyTorch 2.x, torchvision, Hugging Face Transformers.
- Data: PCam HDF5 official splits (no WSI leakage by construction).
- Code: MIT (see
LICENSE). - Backbone / fine‑tuned weights: subject to Meta’s DINOv3 license (HF gated). Do not redistribute weights without complying with the DINOv3 terms.
- Data: PCam (CC0) per the PCam repository.
src/
data/pcam_hf.py # PCam HDF5 dataset w/ HF preprocessing
models/
backbone_dinov3.py # DINOv3 backbone + classifier head
lora.py # LoRA modules and injection utilities
train/
finetune.py # training loops + W&B logging
pruning.py # attention/MLP pruning & SVD compression
utils/
eval_utils.py # metrics, eval, timing, FLOPs helpers
roc_plot.py # ROC plotting from exported probabilities
memory_utils.py # state_dict sizing + quantization helpers
scripts/
download_pcam.py # dataset fetcher
repro_finetuning.sh # baseline fine-tuning recipes
repro_pruning.sh # evaluation/compression recipes
We use three graph-safe compressions that keep external I/O shapes intact while shrinking internal channels/rank. Each uses an energy keep-fraction
Let a block have
Score and energy (Pytorch: out x in):
Sort
Keep those
For an MLP with
Sort
Subselect rows of
For a linear
Pick the smallest rank
then use the rank
We apply this only if it reduces parameters.
--method(defaulthead_only; choiceshead_only,fullft,lora) – selects which parameters train.--lr_head/--lr_backbone/--lr_lora– override optimizer learning rates for the classifier head, backbone, or LoRA adapters.--lora_r(default8),--lora_alpha(default16),--lora_dropout(default0.0) – configure LoRA rank, scaling, and dropout.--lora_targets(defaultq_proj,k_proj,v_proj,o_proj) – comma-separated linear layers to wrap with LoRA; addup_proj,down_projwhen adapters should cover MLPs.--lora_include_mlp– boolean flag to inject LoRA modules into MLP projections automatically.--train_log_every_steps(default50) – logging cadence for training metrics (0disables).--val_eval_frac(default0.25) – fraction of an epoch between mid-epoch validation passes.--val_mid_epoch/--val_epoch_end– toggle mid-epoch and end-of-epoch validation loops.--val_heavy_mid/--val_heavy_end/--val_heavy_zero– enable full metric suites during the corresponding validation passes.--batch_size(default128) – training mini-batch size;--val_batch_sizeis shared but can still be overridden for training runs.--epochs(default10),--lr(default1e-3),--weight_decay(default1e-4),--seed(default1337) – core optimization controls.--max_train_batches(default0) – cap the number of training batches per epoch (0means full epoch).--skip_bench/--lat_warmup(default20)/--lat_iters(default100) – control latency/GFLOPs benchmarking during training.--save_dir(defaultcheckpoints),--save_last,--save_best– configure checkpoint persistence.--train_norms_bias(choicesnone,norms,bias,both) and--lr_norms_bias– opt into training backbone LayerNorms/biases for head-only or LoRA modes and set their learning rate.--warmup_steps(default200) – number of linear warmup steps before cosine decay.--grad_clip(default1.0) – gradient norm clip (0disables).--label_smoothing(default0.0) – apply cross-entropy label smoothing.--val_epoch0– evaluate the checkpoint before any training updates.--aug_histology– enable histology-specific augmentation recipe.--select_metric(defaultauroc; choicesval_loss,auroc,sens95,nll,brier,ece,acc) – metric used to pick the best checkpoint.
--checkpoint(required) – checkpoint path; relative paths resolve insidecheckpoints/saved.--prune_method(defaultnone) – comma-separated sequence of compression passes (truncated_svd,attention_heads,mlp_neurons).--prune_amount(default0.95) – energy keep-threshold; accepts a single float or per-method overrides likeattention_heads=0.9,truncated_svd=0.85.--prune_targets(defaultq_proj,k_proj,v_proj,o_proj,up_proj,down_proj) – layer name substrings for SVD compression (allor*selects every known target).--wandb_run_name– display name for evaluation runs logged to W&B.--results_csv(defaultreports/results_probs.csv) – file that collects per-sample predictions.--results_run_name– optional identifier stored alongside exported probabilities.--compute_flops– enable GFLOPs reporting withptflops.--quantize(defaultnone; choicesnone,int8,bf16) – post-pruning quantization mode.
--data_dir(defaultsrc/data/pcam) – location of the PCam dataset; overridable viamake baseline/make sevalvariables.--model_id(defaultfacebook/dinov3-vits16-pretrain-lvd1689mfor training; optional override for pruning) – Hugging Face backbone identifier.--resolution(default224for training; inherits from checkpoint when unspecified during pruning) – input image size in pixels.--val_batch_size(default256) – evaluation batch size.--num_workers(default4) – DataLoader subprocess count.--tta_eval– toggle test-time augmentation during heavy evaluation passes.--max_eval_batches(default0) – limit validation/test batches processed per epoch run (0= full set).--wandb,--wandb_project(defaultdinov3-pcam-compress),--wandb_entity,--wandb_mode(choicesonline,offline, inherit env) – control Weights & Biases logging.
