Paper title: Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions
Requested Badges:
- Available
- Functional
- Reproduced
Najeeb Jebreel, Mona Khalil, David Sánchez, and Josep Domingo-Ferrer, "Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions", Proceedings on Privacy Enhancing Technologies (PoPETs), 2026. Preprint: https://arxiv.org/abs/2603.07567
This artifact contains the full implementation of the experimental pipeline described in the paper. It covers all five stages of the workflow: shadow-model training, LiRA attack (four variants plus a global-threshold baseline), per-run post-analysis, cross-run reproducibility analysis, and paper-figure generation. All 10 paper benchmarks (Table 1) are manifest-driven and reproducible from a single command.
This artifact is purely academic research software. It implements the LiRA membership inference attack against ML models trained on public benchmark datasets (CIFAR-10, CIFAR-100, GTSRB, Purchase-100). No real personal data is involved. Running the code poses no security or privacy risk to the reviewer's machine.
Minimum: A CUDA-capable GPU (≥ 8 GB VRAM) is required for all image-dataset benchmarks. The Purchase-100 tabular benchmarks can run on CPU but benefit from a GPU.
Authors' machine (paper results):
- GPU: NVIDIA GeForce RTX 4080, 16 GB VRAM
- OS: Ubuntu 20.04 LTS (WSL2 on Windows 11 Education)
- CUDA: 12.6 / Driver 560.94
- RAM: 32 GB
- CPU: Intel Core i7-12700 (10 cores / 20 threads)
Option A — Conda (recommended for bare-metal Linux/WSL2):
| Component | Version used |
|---|---|
| OS | Ubuntu 20.04 LTS (WSL2) |
| Python | 3.11 |
| PyTorch | 2.5.1 |
| torchvision | 0.20.1 |
| numpy | 1.26.4 |
| scipy | 1.14.1 |
| pandas | 2.2.3 |
| seaborn | 0.13.2 |
| scikit-learn | 1.5.2 |
| timm | 1.0.15 |
| matplotlib | 3.9.2 |
| Pillow | 10.4.0 |
| pyyaml | 6.0.2 |
| tqdm | 4.67.1 |
All versions are pinned in requirements-lock.txt. The full pinned environment is reproducible
via environment.yml (see Set up the environment).
Option B — Docker (alternative; requires NVIDIA Container Toolkit):
| Component | Version used |
|---|---|
| Base image | nvidia/cuda:12.6.3-runtime-ubuntu22.04 |
| Miniconda | Miniconda3-py311_24.11.1-0 |
| Python / packages | same as Option A (installed from environment.yml) |
| NVIDIA Container Toolkit | ≥ 1.14 (see install guide) |
A Dockerfile is provided at the repository root. It installs Miniconda and the pinned environment.yml
on top of the CUDA 12.6 base image. Runtime data directories (data/, experiments/,
analysis_results/) are bind-mounted at run time so artefacts survive container exit.
Overall human time: ~30 minutes (environment setup + launching benchmark commands).
Overall compute time: varies by benchmark; see the table in
Experiments. Full reproduction of all 10 benchmarks requires roughly 10–14 days
of sequential GPU compute on an RTX 4080. Reviewers are encouraged to start with purchase100_baseline (≤ ~16 h), then
rerun the core image benchmark (e.g. cifar10_baseline, ~33–34 h) for deeper verification.
Actual time sometimes decreased by early stopping.
Disk space:
- Code + environment: ~2 GB
- One benchmark run (e.g.
cifar10_baseline): ~10–15 GB - All 10 benchmarks: ~100–150 GB
The artifact is publicly available at:
https://github.com/najeebjebreel/lira_analysis
The link above points to the main branch. After the artifact evaluation is finalized, a stable
commit tag will be collected by the artifact chairs.
Option A — Conda (bare-metal Linux/WSL2):
# 1. Clone the repository
git clone https://github.com/najeebjebreel/lira_analysis.git
cd lira_analysis
# 2. Create the pinned Conda environment
conda env create -f environment.yml
conda activate lira-repro
# 3. Stage the Purchase-100 dataset (image datasets download automatically)
# Download features_labels.npy from: https://drive.proton.me/urls/25C1HJ14S8#3uJjfOAAPblu
# Then:
mkdir -p data/purchase
mv features_labels.npy data/purchase/The environment.yml installs Python 3.11 and all packages from requirements-lock.txt in a
single step. After activation, all python *.py invocations are available.
Option B — Docker:
Prerequisites: Docker Engine + NVIDIA Container Toolkit.
# 1. Clone the repository
git clone https://github.com/najeebjebreel/lira_analysis.git
cd lira_analysis
# 2. Build the image (one-time; ~10–15 min, ~6 GB)
docker build -t lira-analysis:main .
# 3. Stage the Purchase-100 dataset (if needed — see Option A step 3 above)
# 4. Launch an interactive shell inside the container
docker run --gpus all --rm -it \
-v $(pwd)/data:/workspace/data \
-v $(pwd)/experiments:/workspace/experiments \
-v $(pwd)/analysis_results:/workspace/analysis_results \
lira-analysis:main bashOnce inside the container, all python commands below work unchanged
(the entry point activates the lira-repro conda environment automatically).
All outputs written to /workspace/data, /workspace/experiments, and
/workspace/analysis_results persist on the host via the bind mounts.
Step 1 — Dry-run (no data or training required; verifies config generation and CLI):
python scripts/run_benchmark.py --benchmark cifar10_baseline --dry-runNote: the configs/generated/ directory does not exist on a fresh clone; it is created automatically at runtime when a benchmark is first executed.
Expected output: three [run] lines (train → attack → analysis) with no errors:
[run] cwd=...
[run] ... python train.py --config .../configs/generated/cifar10_baseline.train.yaml
[run] cwd=...
[run] ... python attack.py --config .../configs/generated/cifar10_baseline.attack.yaml
[run] cwd=...
[run] ... python .../comprehensive_analysis/run_analysis.py --exp-path ...
Step 2 — Full-pipeline smoke test (~5 minutes; CIFAR-10 downloads automatically, no manual staging required):
python scripts/run_benchmark.py --benchmark cifar10_baseline \
--override training.epochs=1 training.end_shadow_model_idx=9This runs all three pipeline phases (train → attack → analysis) with 10 shadow models × 1 epoch. Expected outputs:
experiments/cifar10/resnet18/cifar10_baseline/model_{0..9}/best_model.pth— 10 checkpointsexperiments/cifar10/resnet18/cifar10_baseline/attack_results_leave_one_out_summary.csv— attack metricsanalysis_results/cifar10/resnet18/cifar10_baseline/summary_statistics_two_modes.csv— analysis output
Note: With only 10 shadow models the LiRA scores are degenerate (AUC ≈ 50%) — this is expected and handled gracefully. The smoke test only verifies that the full pipeline runs without errors.
All results below are produced by the five-phase pipeline described in README.md.
The paper's claims are anchored in Section 5 and the corresponding tables and figures.
Under anti-overfitting (AOF) regularisation and transfer-learning (TL) fine-tuning, the true positive rate (TPR) of all four LiRA variants drops substantially compared to the baseline. This holds across all datasets (CIFAR-10, CIFAR-100, GTSRB, Purchase-100). The independent variable is the training regime (baseline / AOF / TL); the dependent variable is TPR at fixed low FPR thresholds (0.001% and 0.1%).
When thresholds are calibrated on shadow models only (the realistic threat model), TPR and positive predictive value (PPV) under skewed priors (1%, 10% membership prior) are significantly lower than under the optimistic leave-one-out setting. The independent variable is the calibration mode (target-and-shadow vs. shadow-only); the dependent variables are TPR and PPV at 0.001% and 0.1% FPR.
The set of samples classified as vulnerable at low FPR (0.001%) changes substantially across independent training runs: Jaccard similarity between pairs of runs is low and the intersection shrinks rapidly. Rankings of samples by LiRA score, however, are more stable. The independent variable is the number of runs compared; the dependent variables are Jaccard index, intersection size, and union size.
Samples with a high loss ratio (train loss / test loss, averaged over shadow models) are more consistently identified as vulnerable. The independent variable is the loss ratio decile; the dependent variable is mean online LiRA TPR.
The table below maps every paper item to the primary script and its expected output path.
| Paper item | Claim | Script | Expected output path |
|---|---|---|---|
| Table 1 | benchmark configurations | benchmark YAMLs | configs/benchmarks/ |
| Table 2 | model utility and loss ratio | train.py |
experiments/<dataset>/<arch>/<run>/train_test_stats.csv, shadow_metrics_aggregate.csv |
| Tables 3–6 | optimistic LiRA performance (baseline / AOF / TL) | attack.py |
experiments/<dataset>/<arch>/<run>/attack_results_leave_one_out_summary.csv |
| Tables 7–10 | realistic shadow thresholds, skewed priors (0.001% FPR) | run_analysis.py |
analysis_results/<dataset>/<arch>/<run>/summary_statistics_two_modes.csv, summary_shadow_prior0p01.tex, … |
| Table 11 | reproducibility of ranking-based sets (median gap) | reproducibility_analysis.py |
analysis_results/tables/topq_tail_table_median_gap_rank_scores.tex |
| Table 12 | realistic shadow thresholds, skewed priors (0.1% FPR) | run_analysis.py |
same .tex outputs as Tables 7–10 (0.1% FPR columns) |
| Table 13 (appendix) | reproducibility of ranking-based sets (mean gap + FPR-thresholded) | reproducibility_analysis.py |
analysis_results/tables/topq_tail_table_mean_gap_rank_scores.tex |
| Table 14 (appendix) | ablation: WideResNet-28-2 on CIFAR-10 | run_analysis.py |
analysis_results/cifar10/wrn28-2/<run>/summary_shadow_prior*.tex |
| Table 15 (appendix) | CIFAR-10 under TL (EfficientNetV2) | run_analysis.py |
analysis_results/cifar10/<efficientnetv2>/<run>/summary_shadow_prior*.tex (requires cifar10_tl benchmark) |
| Figure 1 | threshold variability at 0.001% vs 0.1% FPR | threshold_distribution.py |
analysis_results/figures/thresh_boxplot_single.pdf, thresh_boxplot_multi.pdf |
| Figure 2 | reproducibility, stability, coverage — 0.001% FPR (TP≥1) | reproducibility_analysis.py |
analysis_results/figures/jaccard_noleg_0p001pct.pdf, intersection_0p001pct.pdf, union_0p001pct.pdf |
| Figure 3 | zero-FP reproducibility heatmap — 0.001% FPR | reproducibility_analysis.py |
analysis_results/figures/tpgeq_x_0fp_identical_heatmaps_0p001pct.pdf |
| Figure 4 | top-9 vulnerable samples across runs and training variations | run_analysis.py, compose_top_vulnerable.py |
analysis_results/figures/topk_vulnerable_images/<dataset>/<arch>/<run>/top9_vulnerable_online_shadow_0p001pct.png (assembled manually) |
| Figure 5 | rank displacement under run-to-run variability | reproducibility_analysis.py |
analysis_results/figures/inside_run_displacement_simple/ |
| Figure 6 | reproducibility, stability, coverage — 0.1% FPR (TP≥1) | reproducibility_analysis.py |
analysis_results/figures/jaccard_noleg_0p1pct.pdf, intersection_0p1pct.pdf, union_0p1pct.pdf |
| Figure 7 | loss ratio vs online LiRA TPR correlation | loss_ratio_tpr.py |
analysis_results/figures/lossratio_tpr.pdf |
| Figure 8 | in/out score and ratio distributions across benchmarks | plot_benchmark_distribution.py |
analysis_results/figures/sample_inout_score.pdf, sample_inout_ratio.pdf |
| Figure 9 (appendix) | all-positives reproducibility heatmap — 0.001% FPR | reproducibility_analysis.py |
analysis_results/figures/tpgeq_x_identical_heatmaps_0p001pct.pdf |
| Figure 10 (appendix) | all-positives reproducibility heatmap — 0.1% FPR | reproducibility_analysis.py |
analysis_results/figures/tpgeq_x_identical_heatmaps_0p1pct.pdf |
| Figure 11 (appendix) | top-16 vulnerable samples: 3 seeds × 2 FPR settings | run_analysis.py --top-k 16 --nrow 4 (per seed), then assemble manually |
analysis_results/cifar10/resnet18/<run>/top16_vulnerable_online_shadow_{0p001pct,0p1pct}.png |
| Figure 12 (appendix) | zero-FP reproducibility heatmap — 0.1% FPR | reproducibility_analysis.py |
analysis_results/figures/tpgeq_x_0fp_identical_heatmaps_0p1pct.pdf |
| Figure 13 (appendix) | distribution of median gap vulnerability scores | reproducibility_analysis.py |
analysis_results/figures/bin_boxplot_scores_median_gap_rank_scores.pdf |
| Figure 14 (appendix) | top-1 vulnerable sample median gap score across runs | reproducibility_analysis.py |
analysis_results/figures/top1_across_runs/top1_across_runs_boxplots_median_gap_rank_scores.pdf |
| Figure 15 (appendix) | top-1 sample identity across runs | reproducibility_analysis.py |
analysis_results/figures/top1_across_runs/top1_samples_grid_median_gap_rank_scores.pdf |
- Time: ~5 human-minutes + ≤ ~16 compute-hours (RTX 4080)
- Storage: ~2 GB
This is the recommended entry point for reviewers with limited time or compute. It runs the full end-to-end pipeline (train → attack → analysis) for the Purchase-100 baseline FCN benchmark (tabular data, 1 augmentation, up to 100 epochs × 256 shadow models).
python scripts/run_benchmark.py --benchmark purchase100_baselineExpected outputs under experiments/purchase/fcn/purchase100_baseline/:
shadow_metrics_aggregate.csv— per-model train/test accuracy and loss ratioattack_results_leave_one_out_summary.csv— AUC, TPR@0.001% FPR, TPR@0.1% FPR → Tables 3, 5membership_labels.npy,online_scores_leave_one_out.npy, and other score arrays
Expected outputs under analysis_results/purchase/fcn/purchase100_baseline/:
summary_statistics_two_modes.csv— TPR/PPV under both calibration modes and all three priors → Tables 7, 9samples_vulnerability_ranked_*.csv— per-sample vulnerability scores
Expected numerical values (from paper Tables 6 and 10 — Purchase-100 Baseline):
Table 6 — optimistic (leave-one-out) thresholds, π = 50%:
| Attack | AUC (%) | TPR @ 0.001% FPR (%) | TPR @ 0.1% FPR (%) |
|---|---|---|---|
| LiRA (online) | 70.16 ± 0.29 | 0.523 ± 0.243 | 4.491 ± 0.281 |
| LiRA (online, fixed var) | 69.52 ± 0.28 | 0.180 ± 0.110 | 3.089 ± 0.188 |
| LiRA (offline) | 55.11 ± 0.48 | 0.007 ± 0.007 | 0.500 ± 0.077 |
| LiRA (offline, fixed var) | 56.11 ± 0.51 | 0.022 ± 0.017 | 0.713 ± 0.078 |
| Global threshold | 54.83 ± 0.15 | 0.001 ± 0.001 | 0.100 ± 0.015 |
Table 10 — realistic (shadow-only) thresholds at 0.001% FPR:
| Attack | TPR' (%) | FPR' (%) | PPV @ π=1% | PPV @ π=10% | PPV @ π=50% |
|---|---|---|---|---|---|
| LiRA (online) | 0.516 ± 0.047 | 0.001 ± 0.001 | 89.93 ± 11.15 | 98.84 ± 1.37 | 99.87 ± 0.16 |
| LiRA (online, fixed var) | 0.159 ± 0.015 | 0.001 ± 0.001 | 78.87 ± 22.27 | 96.70 ± 3.80 | 99.61 ± 0.47 |
| LiRA (offline) | 0.004 ± 0.002 | 0.001 ± 0.001 | 55.82 ± 48.27 | 66.20 ± 37.65 | 87.37 ± 16.56 |
| LiRA (offline, fixed var) | 0.018 ± 0.004 | 0.001 ± 0.001 | 57.27 ± 43.60 | 80.46 ± 21.53 | 96.32 ± 4.79 |
- Time: ~10 human-minutes + ~33–34 compute-hours (RTX 4080; ~29 h training ≈ 7 min/model, plus 15–17% attack overhead with 18 augmentations)
- Storage: ~10–15 GB
This reproduces the primary benchmark used throughout the paper. It covers Main Results 1–4 for the CIFAR-10 / ResNet-18 / baseline setting, and its outputs feed into the cross-run reproducibility figures (Experiments 3 and 4) when run under multiple seeds.
python scripts/run_benchmark.py --benchmark cifar10_baselineThe benchmark manifest sets experiment_dir to experiments/cifar10/resnet18/cifar10_baseline.
For a custom named run (e.g. when running multiple seeds), use:
python train.py --config configs/train_image.yaml \
--override seed=1 experiment.run_name=seed1
python attack.py --config configs/attack.yaml \
--override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed1
python comprehensive_analysis/run_analysis.py \
--exp-path experiments/cifar10/resnet18/seed1 \
--target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5Key outputs to verify against the paper (Tables 3–4, 7–8):
experiments/cifar10/resnet18/seed1/attack_results_leave_one_out_summary.csv→ Tables 3–4analysis_results/cifar10/resnet18/seed1/summary_statistics_two_modes.csv→ Tables 7–8experiments/cifar10/resnet18/seed1/roc_curve_single.pdf→ visual sanity check
Expected numerical values (from paper Tables 3 and 7 — CIFAR-10 Baseline):
Table 3 — optimistic (leave-one-out) thresholds, π = 50%:
| Attack | AUC (%) | TPR @ 0.001% FPR (%) | TPR @ 0.1% FPR (%) |
|---|---|---|---|
| LiRA (online) | 76.48 ± 0.32 | 3.956 ± 1.061 | 10.268 ± 0.555 |
| LiRA (online, fixed var) | 76.28 ± 0.31 | 2.876 ± 1.064 | 9.135 ± 0.508 |
| LiRA (offline) | 55.58 ± 0.92 | 0.762 ± 0.348 | 3.262 ± 0.338 |
| LiRA (offline, fixed var) | 56.64 ± 0.89 | 0.948 ± 0.526 | 4.540 ± 0.424 |
| Global threshold | 59.97 ± 0.32 | 0.003 ± 0.004 | 0.097 ± 0.027 |
Table 7 — realistic (shadow-only) thresholds at 0.001% FPR:
| Attack | TPR' (%) | FPR' (%) | PPV @ π=1% | PPV @ π=10% | PPV @ π=50% |
|---|---|---|---|---|---|
| LiRA (online) | 3.990 ± 0.161 | 0.002 ± 0.003 | 94.73 ± 6.10 | 99.46 ± 0.65 | 99.94 ± 0.07 |
| LiRA (online, fixed var) | 2.912 ± 0.142 | 0.002 ± 0.003 | 93.10 ± 8.03 | 99.26 ± 0.91 | 99.92 ± 0.10 |
| LiRA (offline) | 0.713 ± 0.052 | 0.002 ± 0.003 | 81.31 ± 20.20 | 97.24 ± 3.33 | 99.67 ± 0.40 |
| LiRA (offline, fixed var) | 0.918 ± 0.068 | 0.003 ± 0.005 | 81.13 ± 21.29 | 97.03 ± 4.06 | 99.64 ± 0.52 |
Key qualitative claims to verify: (i) LiRA significantly degrades with AOF/AOF+TL (core claim, Table 3); (ii) PPV drops sharply with AOF/TL + shadow-based threshold calibration + skewed priors ( Table 7).
Simplified option — Mini CIFAR-10 (~7–9 h on RTX 4080):
For reviewers who cannot afford the full 34–39 h run, use 64 shadow models instead of 256. This validates the full pipeline end-to-end and qualitatively reproduces the attack ordering, but the numerical values (especially TPR at very low FPR) will differ from the paper:
python train.py --config configs/train_image.yaml \
--override seed=1 experiment.run_name=seed1_mini \
training.end_shadow_model_idx=63
python attack.py --config configs/attack.yaml \
--override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed1_mini
python comprehensive_analysis/run_analysis.py \
--exp-path experiments/cifar10/resnet18/seed1_mini \
--target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5The script auto-discovers all Phase 3 outputs under --analysis-root — any number of runs is
supported. Choose the tier that fits your available compute:
Option A — Partial rerun with 3–4 seeds (~100–160 compute-hours on RTX 4080)
Run Experiment 2 for 3–4 seeds, then feed those results into the analysis. This is enough to verify the trend (Jaccard decreasing with k, intersection shrinking).
# Train + attack + per-run analysis for seeds 1–4
for i in 1 2 3 4; do
python train.py --config configs/train_image.yaml \
--override seed=${i} experiment.run_name=seed${i}
python attack.py --config configs/attack.yaml \
--override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed${i}
python comprehensive_analysis/run_analysis.py \
--exp-path experiments/cifar10/resnet18/seed${i} \
--target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5
done
# Reproducibility analysis over the runs
python comprehensive_analysis/reproducibility_analysis.py \
--analysis-root analysis_results/cifar10/resnet18 \
--skip-rankThe --skip-rank flag omits the rank-stability section, which requires experiment
directories with raw score files (not just analysis outputs).
Option B — Full paper reproduction (12 seeds + 4 variants, ~500+ compute-hours)
Run all 12 seeds of cifar10_baseline plus the four training-variation runs
(bs512_drp0.2, mixup_drp0.15, tl, wrn28-2/seed42), then:
python comprehensive_analysis/reproducibility_analysis.pyExpected outputs:
analysis_results/figures/:
jaccard_noleg_0p001pct.pdf,intersection_0p001pct.pdf,union_0p001pct.pdf→ Figure 2tpgeq_x_0fp_identical_heatmaps_0p001pct.pdf→ Figure 3jaccard_noleg_0p1pct.pdf,intersection_0p1pct.pdf,union_0p1pct.pdf→ Figure 6tpgeq_x_identical_heatmaps_0p001pct.pdf→ Figure 9 (appendix)tpgeq_x_identical_heatmaps_0p1pct.pdf→ Figure 10 (appendix)tpgeq_x_0fp_identical_heatmaps_0p1pct.pdf→ Figure 12 (appendix)
analysis_results/tables/:
reproducibility_<N>runs_<M>variants_0p001pct.csv— Jaccard/intersection/union (N=12, M=4 for full paper set)
analysis_results/figures/ (rank-stability additions):
inside_run_displacement_simple/→ Figure 5bin_boxplot_scores_median_gap_rank_scores.pdf→ Figure 13 (appendix)top1_across_runs/top1_across_runs_boxplots_median_gap_rank_scores.pdf→ Figure 14 (appendix)top1_across_runs/top1_samples_grid_median_gap_rank_scores.pdf→ Figure 15 (appendix)
analysis_results/tables/ (rank-stability additions):
topq_tail_table_median_gap_rank_scores.tex→ Table 11 (appendix)topq_tail_table_mean_gap_rank_scores.tex→ Table 13 (appendix)
Verify generated PDFs and CSVs against the values reported in the paper.
- Time: ~5 human-minutes + < 15 compute-minutes (requires Phase 3 outputs to be present)
All scripts read Phase 3/4 outputs and write PDFs to analysis_results/figures/.
Figure 1 — threshold variability boxplots:
python comprehensive_analysis/threshold_distribution.pyExpected: analysis_results/figures/thresh_boxplot_single.pdf, thresh_boxplot_multi.pdf
Figure 4 — collect per-run top-vulnerable grids (requires Phase 3 grids; panel assembled manually):
python comprehensive_analysis/compose_top_vulnerable.pyExpected: analysis_results/figures/topk_vulnerable_images/<dataset>/<arch>/<run>/top9_vulnerable_*.png
Figure 7 — loss ratio vs TPR correlation:
python comprehensive_analysis/loss_ratio_tpr.pyExpected: analysis_results/figures/lossratio_tpr.pdf
Figure 8 — per-benchmark score and ratio distributions:
python comprehensive_analysis/plot_benchmark_distribution.py \
--config configs/figure_panels/figure8_scores.yaml
python comprehensive_analysis/plot_benchmark_distribution.py \
--config configs/figure_panels/figure8_ratios.yamlExpected: analysis_results/figures/sample_inout_score.pdf, sample_inout_ratio.pdf
Figure 11 (appendix) — top-16 vulnerable samples: 3 seeds × 2 FPR settings (manual assembly):
# Regenerate top-16 grids for seeds 1–3
for i in 1 2 3; do
python comprehensive_analysis/run_analysis.py \
--exp-path experiments/cifar10/resnet18/seed${i} \
--target-fprs 0.00001 0.001 --top-k 16 --nrow 4
doneThis produces six PNG grids:
analysis_results/cifar10/resnet18/seed{1,2,3}/top16_vulnerable_online_shadow_0p001pct.pnganalysis_results/cifar10/resnet18/seed{1,2,3}/top16_vulnerable_online_shadow_0p1pct.png
Arrange them in a 2 × 3 layout (rows = FPR threshold, columns = seed) using any image editor or
LaTeX \includegraphics to reproduce Figure 11 as shown in the paper.
Verify generated PDFs against the figures in the paper.
The table below summarises reproducibility status for each paper item, followed by details on items that are not fully reproducible within a typical evaluation timeframe.
| Paper item | Reproducible? | Reason / compute cost |
|---|---|---|
| Tables 3–10, Figures 2–7 (CIFAR-10 baseline) | Yes, with sufficient compute | One seed: ~33–34 h; mini (64 models): ~8–9 h |
| Tables 3–10 (Purchase-100) | Yes | ≤ ~16 h end-to-end (upper bound; early stopping reduces this) |
| Tables 3–6, Figures 2–3 (CIFAR-100, GTSRB) | Partial | ~38–39 h per CIFAR-100 benchmark; omit if time-constrained |
| Figure 8 (all-benchmark score distributions) | Partial | Requires all 10 benchmarks; a subset gives a partial panel |
| Figure 11 (top-16 grid, 3 seeds × 2 FPR) | Partial — manual step | Grids generated by script; final layout requires manual image arrangement |
| Table 15 (CIFAR-10 TL, EfficientNetV2) | Not in quick path | Requires cifar10_tl benchmark (~8–13 h) |
| Figures 2, 3, 5, 6 (full reproducibility, 12 seeds) | Partial | 3–4 seeds reproduce the trend; full 12-seed set needs ~500+ compute-hours |
-
Full-paper rerun is GPU-intensive. Reproducing all 10 benchmarks from scratch requires roughly 10–14 days of sequential compute on an RTX 4080. The quick-check path (
purchase100_baseline, ≤ ~16 h;cifar10_baseline, ~33–34 h) is sufficient to verify the pipeline end-to-end and cross-check the paper's primary numerical claims. -
Purchase-100 requires manual dataset staging. The dataset file must be downloaded from the link in
data/Readme.mdand placed underdata/purchase/before running any Purchase benchmark. All image datasets (CIFAR-10, CIFAR-100, GTSRB) download automatically on first use. -
Figure 8 requires multiple benchmarks. The panel is composed from all 10 benchmarks (see
configs/figure_panels/). Reviewers can run only a subset to get a partial figure. -
Figure 11 (appendix) requires a top-k 16 rerun and manual assembly. The default
run_analysis.pygeneratestop9_*grids. Reproducing Figure 11 requires rerunning Phase 3 for seeds 1–3 with--top-k 16 --nrow 4, then arranging the six resulting PNG files in a 2 × 3 grid manually or via LaTeX. -
Table 15 (appendix) requires the
cifar10_tlbenchmark. Reproducing it requires runningcifar10_tl(~8–13 h) and thenrun_analysis.pyon the resulting experiment directory. -
Tested on WSL2 / Ubuntu 20.04 only. The artifact should work on any Linux system with CUDA, but has not been tested on macOS or native Windows. The Docker image has been built and tested on the same WSL2 host; reviewers on other Linux distributions can use it as a portable alternative to bare-metal Conda setup.
The pipeline is fully config-driven and designed for extension:
- New datasets: Add a dataset loader in
utils/data_utils.pyand a benchmark manifest YAML underconfigs/benchmarks/. - New architectures: Register the model in
utils/model_utils.py; no other changes are needed. - New attack variants: Add a scorer in
attacks/lira.py; the post-analysis scripts pick up any score file matching the naming convention inOUTPUTS.md. - Partial reruns: The
--stages,--skip-existing, and--overrideflags onpython scripts/run_benchmark.pyallow fine-grained control over which pipeline stages are (re-)executed. - Sharded training: The
training.start_shadow_model_idx/training.end_shadow_model_idxoverrides allow shadow-model training to be distributed across multiple machines and merged afterward.
The benchmark manifest system (configs/benchmarks/) is a lightweight, self-contained contract
format that can be adapted to replicate the evaluation protocol of this paper on entirely new
settings.