Skip to content

Latest commit

 

History

History
580 lines (432 loc) · 28.2 KB

File metadata and controls

580 lines (432 loc) · 28.2 KB

Artifact Appendix

Paper title: Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions

Requested Badges:

  • Available
  • Functional
  • Reproduced

Description

Paper Reference

Najeeb Jebreel, Mona Khalil, David Sánchez, and Josep Domingo-Ferrer, "Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions", Proceedings on Privacy Enhancing Technologies (PoPETs), 2026. Preprint: https://arxiv.org/abs/2603.07567

Artifact Summary

This artifact contains the full implementation of the experimental pipeline described in the paper. It covers all five stages of the workflow: shadow-model training, LiRA attack (four variants plus a global-threshold baseline), per-run post-analysis, cross-run reproducibility analysis, and paper-figure generation. All 10 paper benchmarks (Table 1) are manifest-driven and reproducible from a single command.

Security/Privacy Issues and Ethical Concerns

This artifact is purely academic research software. It implements the LiRA membership inference attack against ML models trained on public benchmark datasets (CIFAR-10, CIFAR-100, GTSRB, Purchase-100). No real personal data is involved. Running the code poses no security or privacy risk to the reviewer's machine.


Basic Requirements

Hardware Requirements

Minimum: A CUDA-capable GPU (≥ 8 GB VRAM) is required for all image-dataset benchmarks. The Purchase-100 tabular benchmarks can run on CPU but benefit from a GPU.

Authors' machine (paper results):

  • GPU: NVIDIA GeForce RTX 4080, 16 GB VRAM
  • OS: Ubuntu 20.04 LTS (WSL2 on Windows 11 Education)
  • CUDA: 12.6 / Driver 560.94
  • RAM: 32 GB
  • CPU: Intel Core i7-12700 (10 cores / 20 threads)

Software Requirements

Option A — Conda (recommended for bare-metal Linux/WSL2):

Component Version used
OS Ubuntu 20.04 LTS (WSL2)
Python 3.11
PyTorch 2.5.1
torchvision 0.20.1
numpy 1.26.4
scipy 1.14.1
pandas 2.2.3
seaborn 0.13.2
scikit-learn 1.5.2
timm 1.0.15
matplotlib 3.9.2
Pillow 10.4.0
pyyaml 6.0.2
tqdm 4.67.1

All versions are pinned in requirements-lock.txt. The full pinned environment is reproducible via environment.yml (see Set up the environment).

Option B — Docker (alternative; requires NVIDIA Container Toolkit):

Component Version used
Base image nvidia/cuda:12.6.3-runtime-ubuntu22.04
Miniconda Miniconda3-py311_24.11.1-0
Python / packages same as Option A (installed from environment.yml)
NVIDIA Container Toolkit ≥ 1.14 (see install guide)

A Dockerfile is provided at the repository root. It installs Miniconda and the pinned environment.yml on top of the CUDA 12.6 base image. Runtime data directories (data/, experiments/, analysis_results/) are bind-mounted at run time so artefacts survive container exit.

Estimated Time and Storage Consumption

Overall human time: ~30 minutes (environment setup + launching benchmark commands).

Overall compute time: varies by benchmark; see the table in Experiments. Full reproduction of all 10 benchmarks requires roughly 10–14 days of sequential GPU compute on an RTX 4080. Reviewers are encouraged to start with purchase100_baseline (≤ ~16 h), then rerun the core image benchmark (e.g. cifar10_baseline, ~33–34 h) for deeper verification. Actual time sometimes decreased by early stopping.

Disk space:

  • Code + environment: ~2 GB
  • One benchmark run (e.g. cifar10_baseline): ~10–15 GB
  • All 10 benchmarks: ~100–150 GB

Environment

Accessibility

The artifact is publicly available at:

https://github.com/najeebjebreel/lira_analysis

The link above points to the main branch. After the artifact evaluation is finalized, a stable commit tag will be collected by the artifact chairs.

Set up the Environment

Option A — Conda (bare-metal Linux/WSL2):

# 1. Clone the repository
git clone https://github.com/najeebjebreel/lira_analysis.git
cd lira_analysis

# 2. Create the pinned Conda environment
conda env create -f environment.yml
conda activate lira-repro

# 3. Stage the Purchase-100 dataset (image datasets download automatically)
#    Download features_labels.npy from: https://drive.proton.me/urls/25C1HJ14S8#3uJjfOAAPblu
#    Then:
mkdir -p data/purchase
mv features_labels.npy data/purchase/

The environment.yml installs Python 3.11 and all packages from requirements-lock.txt in a single step. After activation, all python *.py invocations are available.


Option B — Docker:

Prerequisites: Docker Engine + NVIDIA Container Toolkit.

# 1. Clone the repository
git clone https://github.com/najeebjebreel/lira_analysis.git
cd lira_analysis

# 2. Build the image (one-time; ~10–15 min, ~6 GB)
docker build -t lira-analysis:main .

# 3. Stage the Purchase-100 dataset (if needed — see Option A step 3 above)

# 4. Launch an interactive shell inside the container
docker run --gpus all --rm -it \
    -v $(pwd)/data:/workspace/data \
    -v $(pwd)/experiments:/workspace/experiments \
    -v $(pwd)/analysis_results:/workspace/analysis_results \
    lira-analysis:main bash

Once inside the container, all python commands below work unchanged (the entry point activates the lira-repro conda environment automatically). All outputs written to /workspace/data, /workspace/experiments, and /workspace/analysis_results persist on the host via the bind mounts.

Testing the Environment

Step 1 — Dry-run (no data or training required; verifies config generation and CLI):

python scripts/run_benchmark.py --benchmark cifar10_baseline --dry-run

Note: the configs/generated/ directory does not exist on a fresh clone; it is created automatically at runtime when a benchmark is first executed.

Expected output: three [run] lines (train → attack → analysis) with no errors:

[run] cwd=...
[run] ... python train.py --config .../configs/generated/cifar10_baseline.train.yaml
[run] cwd=...
[run] ... python attack.py --config .../configs/generated/cifar10_baseline.attack.yaml
[run] cwd=...
[run] ... python .../comprehensive_analysis/run_analysis.py --exp-path ...

Step 2 — Full-pipeline smoke test (~5 minutes; CIFAR-10 downloads automatically, no manual staging required):

python scripts/run_benchmark.py --benchmark cifar10_baseline \
  --override training.epochs=1 training.end_shadow_model_idx=9

This runs all three pipeline phases (train → attack → analysis) with 10 shadow models × 1 epoch. Expected outputs:

  • experiments/cifar10/resnet18/cifar10_baseline/model_{0..9}/best_model.pth — 10 checkpoints
  • experiments/cifar10/resnet18/cifar10_baseline/attack_results_leave_one_out_summary.csv — attack metrics
  • analysis_results/cifar10/resnet18/cifar10_baseline/summary_statistics_two_modes.csv — analysis output

Note: With only 10 shadow models the LiRA scores are degenerate (AUC ≈ 50%) — this is expected and handled gracefully. The smoke test only verifies that the full pipeline runs without errors.


Artifact Evaluation

Main Results and Claims

All results below are produced by the five-phase pipeline described in README.md. The paper's claims are anchored in Section 5 and the corresponding tables and figures.

Main Result 1: AOF and TL reduce LiRA success (Tables 3–6, Figure 8)

Under anti-overfitting (AOF) regularisation and transfer-learning (TL) fine-tuning, the true positive rate (TPR) of all four LiRA variants drops substantially compared to the baseline. This holds across all datasets (CIFAR-10, CIFAR-100, GTSRB, Purchase-100). The independent variable is the training regime (baseline / AOF / TL); the dependent variable is TPR at fixed low FPR thresholds (0.001% and 0.1%).

Main Result 2: Shadow-only threshold calibration degrades attack precision (Tables 7–10)

When thresholds are calibrated on shadow models only (the realistic threat model), TPR and positive predictive value (PPV) under skewed priors (1%, 10% membership prior) are significantly lower than under the optimistic leave-one-out setting. The independent variable is the calibration mode (target-and-shadow vs. shadow-only); the dependent variables are TPR and PPV at 0.001% and 0.1% FPR.

Main Result 3: Thresholded vulnerable sets are unstable across runs (Figures 2, 3, 6)

The set of samples classified as vulnerable at low FPR (0.001%) changes substantially across independent training runs: Jaccard similarity between pairs of runs is low and the intersection shrinks rapidly. Rankings of samples by LiRA score, however, are more stable. The independent variable is the number of runs compared; the dependent variables are Jaccard index, intersection size, and union size.

Main Result 4: Loss ratio correlates with LiRA success (Figure 7)

Samples with a high loss ratio (train loss / test loss, averaged over shadow models) are more consistently identified as vulnerable. The independent variable is the loss ratio decile; the dependent variable is mean online LiRA TPR.

Paper-to-Artifact Map

The table below maps every paper item to the primary script and its expected output path.

Paper item Claim Script Expected output path
Table 1 benchmark configurations benchmark YAMLs configs/benchmarks/
Table 2 model utility and loss ratio train.py experiments/<dataset>/<arch>/<run>/train_test_stats.csv, shadow_metrics_aggregate.csv
Tables 3–6 optimistic LiRA performance (baseline / AOF / TL) attack.py experiments/<dataset>/<arch>/<run>/attack_results_leave_one_out_summary.csv
Tables 7–10 realistic shadow thresholds, skewed priors (0.001% FPR) run_analysis.py analysis_results/<dataset>/<arch>/<run>/summary_statistics_two_modes.csv, summary_shadow_prior0p01.tex, …
Table 11 reproducibility of ranking-based sets (median gap) reproducibility_analysis.py analysis_results/tables/topq_tail_table_median_gap_rank_scores.tex
Table 12 realistic shadow thresholds, skewed priors (0.1% FPR) run_analysis.py same .tex outputs as Tables 7–10 (0.1% FPR columns)
Table 13 (appendix) reproducibility of ranking-based sets (mean gap + FPR-thresholded) reproducibility_analysis.py analysis_results/tables/topq_tail_table_mean_gap_rank_scores.tex
Table 14 (appendix) ablation: WideResNet-28-2 on CIFAR-10 run_analysis.py analysis_results/cifar10/wrn28-2/<run>/summary_shadow_prior*.tex
Table 15 (appendix) CIFAR-10 under TL (EfficientNetV2) run_analysis.py analysis_results/cifar10/<efficientnetv2>/<run>/summary_shadow_prior*.tex (requires cifar10_tl benchmark)
Figure 1 threshold variability at 0.001% vs 0.1% FPR threshold_distribution.py analysis_results/figures/thresh_boxplot_single.pdf, thresh_boxplot_multi.pdf
Figure 2 reproducibility, stability, coverage — 0.001% FPR (TP≥1) reproducibility_analysis.py analysis_results/figures/jaccard_noleg_0p001pct.pdf, intersection_0p001pct.pdf, union_0p001pct.pdf
Figure 3 zero-FP reproducibility heatmap — 0.001% FPR reproducibility_analysis.py analysis_results/figures/tpgeq_x_0fp_identical_heatmaps_0p001pct.pdf
Figure 4 top-9 vulnerable samples across runs and training variations run_analysis.py, compose_top_vulnerable.py analysis_results/figures/topk_vulnerable_images/<dataset>/<arch>/<run>/top9_vulnerable_online_shadow_0p001pct.png (assembled manually)
Figure 5 rank displacement under run-to-run variability reproducibility_analysis.py analysis_results/figures/inside_run_displacement_simple/
Figure 6 reproducibility, stability, coverage — 0.1% FPR (TP≥1) reproducibility_analysis.py analysis_results/figures/jaccard_noleg_0p1pct.pdf, intersection_0p1pct.pdf, union_0p1pct.pdf
Figure 7 loss ratio vs online LiRA TPR correlation loss_ratio_tpr.py analysis_results/figures/lossratio_tpr.pdf
Figure 8 in/out score and ratio distributions across benchmarks plot_benchmark_distribution.py analysis_results/figures/sample_inout_score.pdf, sample_inout_ratio.pdf
Figure 9 (appendix) all-positives reproducibility heatmap — 0.001% FPR reproducibility_analysis.py analysis_results/figures/tpgeq_x_identical_heatmaps_0p001pct.pdf
Figure 10 (appendix) all-positives reproducibility heatmap — 0.1% FPR reproducibility_analysis.py analysis_results/figures/tpgeq_x_identical_heatmaps_0p1pct.pdf
Figure 11 (appendix) top-16 vulnerable samples: 3 seeds × 2 FPR settings run_analysis.py --top-k 16 --nrow 4 (per seed), then assemble manually analysis_results/cifar10/resnet18/<run>/top16_vulnerable_online_shadow_{0p001pct,0p1pct}.png
Figure 12 (appendix) zero-FP reproducibility heatmap — 0.1% FPR reproducibility_analysis.py analysis_results/figures/tpgeq_x_0fp_identical_heatmaps_0p1pct.pdf
Figure 13 (appendix) distribution of median gap vulnerability scores reproducibility_analysis.py analysis_results/figures/bin_boxplot_scores_median_gap_rank_scores.pdf
Figure 14 (appendix) top-1 vulnerable sample median gap score across runs reproducibility_analysis.py analysis_results/figures/top1_across_runs/top1_across_runs_boxplots_median_gap_rank_scores.pdf
Figure 15 (appendix) top-1 sample identity across runs reproducibility_analysis.py analysis_results/figures/top1_across_runs/top1_samples_grid_median_gap_rank_scores.pdf

Experiments

Experiment 1: Quick Check — purchase100_baseline

  • Time: ~5 human-minutes + ≤ ~16 compute-hours (RTX 4080)
  • Storage: ~2 GB

This is the recommended entry point for reviewers with limited time or compute. It runs the full end-to-end pipeline (train → attack → analysis) for the Purchase-100 baseline FCN benchmark (tabular data, 1 augmentation, up to 100 epochs × 256 shadow models).

python scripts/run_benchmark.py --benchmark purchase100_baseline

Expected outputs under experiments/purchase/fcn/purchase100_baseline/:

  • shadow_metrics_aggregate.csv — per-model train/test accuracy and loss ratio
  • attack_results_leave_one_out_summary.csv — AUC, TPR@0.001% FPR, TPR@0.1% FPR → Tables 3, 5
  • membership_labels.npy, online_scores_leave_one_out.npy, and other score arrays

Expected outputs under analysis_results/purchase/fcn/purchase100_baseline/:

  • summary_statistics_two_modes.csv — TPR/PPV under both calibration modes and all three priors → Tables 7, 9
  • samples_vulnerability_ranked_*.csv — per-sample vulnerability scores

Expected numerical values (from paper Tables 6 and 10 — Purchase-100 Baseline):

Table 6 — optimistic (leave-one-out) thresholds, π = 50%:

Attack AUC (%) TPR @ 0.001% FPR (%) TPR @ 0.1% FPR (%)
LiRA (online) 70.16 ± 0.29 0.523 ± 0.243 4.491 ± 0.281
LiRA (online, fixed var) 69.52 ± 0.28 0.180 ± 0.110 3.089 ± 0.188
LiRA (offline) 55.11 ± 0.48 0.007 ± 0.007 0.500 ± 0.077
LiRA (offline, fixed var) 56.11 ± 0.51 0.022 ± 0.017 0.713 ± 0.078
Global threshold 54.83 ± 0.15 0.001 ± 0.001 0.100 ± 0.015

Table 10 — realistic (shadow-only) thresholds at 0.001% FPR:

Attack TPR' (%) FPR' (%) PPV @ π=1% PPV @ π=10% PPV @ π=50%
LiRA (online) 0.516 ± 0.047 0.001 ± 0.001 89.93 ± 11.15 98.84 ± 1.37 99.87 ± 0.16
LiRA (online, fixed var) 0.159 ± 0.015 0.001 ± 0.001 78.87 ± 22.27 96.70 ± 3.80 99.61 ± 0.47
LiRA (offline) 0.004 ± 0.002 0.001 ± 0.001 55.82 ± 48.27 66.20 ± 37.65 87.37 ± 16.56
LiRA (offline, fixed var) 0.018 ± 0.004 0.001 ± 0.001 57.27 ± 43.60 80.46 ± 21.53 96.32 ± 4.79

Experiment 2: Core CIFAR-10 Benchmark — cifar10_baseline

  • Time: ~10 human-minutes + ~33–34 compute-hours (RTX 4080; ~29 h training ≈ 7 min/model, plus 15–17% attack overhead with 18 augmentations)
  • Storage: ~10–15 GB

This reproduces the primary benchmark used throughout the paper. It covers Main Results 1–4 for the CIFAR-10 / ResNet-18 / baseline setting, and its outputs feed into the cross-run reproducibility figures (Experiments 3 and 4) when run under multiple seeds.

python scripts/run_benchmark.py --benchmark cifar10_baseline

The benchmark manifest sets experiment_dir to experiments/cifar10/resnet18/cifar10_baseline. For a custom named run (e.g. when running multiple seeds), use:

python train.py --config configs/train_image.yaml \
  --override seed=1 experiment.run_name=seed1
python attack.py --config configs/attack.yaml \
  --override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed1
python comprehensive_analysis/run_analysis.py \
  --exp-path experiments/cifar10/resnet18/seed1 \
  --target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5

Key outputs to verify against the paper (Tables 3–4, 7–8):

  • experiments/cifar10/resnet18/seed1/attack_results_leave_one_out_summary.csv → Tables 3–4
  • analysis_results/cifar10/resnet18/seed1/summary_statistics_two_modes.csv → Tables 7–8
  • experiments/cifar10/resnet18/seed1/roc_curve_single.pdf → visual sanity check

Expected numerical values (from paper Tables 3 and 7 — CIFAR-10 Baseline):

Table 3 — optimistic (leave-one-out) thresholds, π = 50%:

Attack AUC (%) TPR @ 0.001% FPR (%) TPR @ 0.1% FPR (%)
LiRA (online) 76.48 ± 0.32 3.956 ± 1.061 10.268 ± 0.555
LiRA (online, fixed var) 76.28 ± 0.31 2.876 ± 1.064 9.135 ± 0.508
LiRA (offline) 55.58 ± 0.92 0.762 ± 0.348 3.262 ± 0.338
LiRA (offline, fixed var) 56.64 ± 0.89 0.948 ± 0.526 4.540 ± 0.424
Global threshold 59.97 ± 0.32 0.003 ± 0.004 0.097 ± 0.027

Table 7 — realistic (shadow-only) thresholds at 0.001% FPR:

Attack TPR' (%) FPR' (%) PPV @ π=1% PPV @ π=10% PPV @ π=50%
LiRA (online) 3.990 ± 0.161 0.002 ± 0.003 94.73 ± 6.10 99.46 ± 0.65 99.94 ± 0.07
LiRA (online, fixed var) 2.912 ± 0.142 0.002 ± 0.003 93.10 ± 8.03 99.26 ± 0.91 99.92 ± 0.10
LiRA (offline) 0.713 ± 0.052 0.002 ± 0.003 81.31 ± 20.20 97.24 ± 3.33 99.67 ± 0.40
LiRA (offline, fixed var) 0.918 ± 0.068 0.003 ± 0.005 81.13 ± 21.29 97.03 ± 4.06 99.64 ± 0.52

Key qualitative claims to verify: (i) LiRA significantly degrades with AOF/AOF+TL (core claim, Table 3); (ii) PPV drops sharply with AOF/TL + shadow-based threshold calibration + skewed priors ( Table 7).


Simplified option — Mini CIFAR-10 (~7–9 h on RTX 4080):

For reviewers who cannot afford the full 34–39 h run, use 64 shadow models instead of 256. This validates the full pipeline end-to-end and qualitatively reproduces the attack ordering, but the numerical values (especially TPR at very low FPR) will differ from the paper:

python train.py --config configs/train_image.yaml \
  --override seed=1 experiment.run_name=seed1_mini \
             training.end_shadow_model_idx=63
python attack.py --config configs/attack.yaml \
  --override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed1_mini
python comprehensive_analysis/run_analysis.py \
  --exp-path experiments/cifar10/resnet18/seed1_mini \
  --target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5

Experiment 3: Reproducibility and Rank Stability (Figures 2, 3, 5, 6)

The script auto-discovers all Phase 3 outputs under --analysis-root — any number of runs is supported. Choose the tier that fits your available compute:


Option A — Partial rerun with 3–4 seeds (~100–160 compute-hours on RTX 4080)

Run Experiment 2 for 3–4 seeds, then feed those results into the analysis. This is enough to verify the trend (Jaccard decreasing with k, intersection shrinking).

# Train + attack + per-run analysis for seeds 1–4
for i in 1 2 3 4; do
  python train.py --config configs/train_image.yaml \
    --override seed=${i} experiment.run_name=seed${i}
  python attack.py --config configs/attack.yaml \
    --override experiment.checkpoint_dir=experiments/cifar10/resnet18/seed${i}
  python comprehensive_analysis/run_analysis.py \
    --exp-path experiments/cifar10/resnet18/seed${i} \
    --target-fprs 0.00001 0.001 --priors 0.01 0.1 0.5
done

# Reproducibility analysis over the runs
python comprehensive_analysis/reproducibility_analysis.py \
  --analysis-root analysis_results/cifar10/resnet18 \
  --skip-rank

The --skip-rank flag omits the rank-stability section, which requires experiment directories with raw score files (not just analysis outputs).


Option B — Full paper reproduction (12 seeds + 4 variants, ~500+ compute-hours)

Run all 12 seeds of cifar10_baseline plus the four training-variation runs (bs512_drp0.2, mixup_drp0.15, tl, wrn28-2/seed42), then:

python comprehensive_analysis/reproducibility_analysis.py

Expected outputs:

analysis_results/figures/:

  • jaccard_noleg_0p001pct.pdf, intersection_0p001pct.pdf, union_0p001pct.pdfFigure 2
  • tpgeq_x_0fp_identical_heatmaps_0p001pct.pdfFigure 3
  • jaccard_noleg_0p1pct.pdf, intersection_0p1pct.pdf, union_0p1pct.pdfFigure 6
  • tpgeq_x_identical_heatmaps_0p001pct.pdfFigure 9 (appendix)
  • tpgeq_x_identical_heatmaps_0p1pct.pdfFigure 10 (appendix)
  • tpgeq_x_0fp_identical_heatmaps_0p1pct.pdfFigure 12 (appendix)

analysis_results/tables/:

  • reproducibility_<N>runs_<M>variants_0p001pct.csv — Jaccard/intersection/union (N=12, M=4 for full paper set)

analysis_results/figures/ (rank-stability additions):

  • inside_run_displacement_simple/Figure 5
  • bin_boxplot_scores_median_gap_rank_scores.pdfFigure 13 (appendix)
  • top1_across_runs/top1_across_runs_boxplots_median_gap_rank_scores.pdfFigure 14 (appendix)
  • top1_across_runs/top1_samples_grid_median_gap_rank_scores.pdfFigure 15 (appendix)

analysis_results/tables/ (rank-stability additions):

  • topq_tail_table_median_gap_rank_scores.texTable 11 (appendix)
  • topq_tail_table_mean_gap_rank_scores.texTable 13 (appendix)

Verify generated PDFs and CSVs against the values reported in the paper.

Experiment 4: Paper Figure Scripts (Figures 1, 4, 7, 8)

  • Time: ~5 human-minutes + < 15 compute-minutes (requires Phase 3 outputs to be present)

All scripts read Phase 3/4 outputs and write PDFs to analysis_results/figures/.

Figure 1 — threshold variability boxplots:

python comprehensive_analysis/threshold_distribution.py

Expected: analysis_results/figures/thresh_boxplot_single.pdf, thresh_boxplot_multi.pdf

Figure 4 — collect per-run top-vulnerable grids (requires Phase 3 grids; panel assembled manually):

python comprehensive_analysis/compose_top_vulnerable.py

Expected: analysis_results/figures/topk_vulnerable_images/<dataset>/<arch>/<run>/top9_vulnerable_*.png

Figure 7 — loss ratio vs TPR correlation:

python comprehensive_analysis/loss_ratio_tpr.py

Expected: analysis_results/figures/lossratio_tpr.pdf

Figure 8 — per-benchmark score and ratio distributions:

python comprehensive_analysis/plot_benchmark_distribution.py \
  --config configs/figure_panels/figure8_scores.yaml
python comprehensive_analysis/plot_benchmark_distribution.py \
  --config configs/figure_panels/figure8_ratios.yaml

Expected: analysis_results/figures/sample_inout_score.pdf, sample_inout_ratio.pdf

Figure 11 (appendix) — top-16 vulnerable samples: 3 seeds × 2 FPR settings (manual assembly):

# Regenerate top-16 grids for seeds 1–3
for i in 1 2 3; do
  python comprehensive_analysis/run_analysis.py \
    --exp-path experiments/cifar10/resnet18/seed${i} \
    --target-fprs 0.00001 0.001 --top-k 16 --nrow 4
done

This produces six PNG grids:

  • analysis_results/cifar10/resnet18/seed{1,2,3}/top16_vulnerable_online_shadow_0p001pct.png
  • analysis_results/cifar10/resnet18/seed{1,2,3}/top16_vulnerable_online_shadow_0p1pct.png

Arrange them in a 2 × 3 layout (rows = FPR threshold, columns = seed) using any image editor or LaTeX \includegraphics to reproduce Figure 11 as shown in the paper.

Verify generated PDFs against the figures in the paper.


Limitations

The table below summarises reproducibility status for each paper item, followed by details on items that are not fully reproducible within a typical evaluation timeframe.

Paper item Reproducible? Reason / compute cost
Tables 3–10, Figures 2–7 (CIFAR-10 baseline) Yes, with sufficient compute One seed: ~33–34 h; mini (64 models): ~8–9 h
Tables 3–10 (Purchase-100) Yes ≤ ~16 h end-to-end (upper bound; early stopping reduces this)
Tables 3–6, Figures 2–3 (CIFAR-100, GTSRB) Partial ~38–39 h per CIFAR-100 benchmark; omit if time-constrained
Figure 8 (all-benchmark score distributions) Partial Requires all 10 benchmarks; a subset gives a partial panel
Figure 11 (top-16 grid, 3 seeds × 2 FPR) Partial — manual step Grids generated by script; final layout requires manual image arrangement
Table 15 (CIFAR-10 TL, EfficientNetV2) Not in quick path Requires cifar10_tl benchmark (~8–13 h)
Figures 2, 3, 5, 6 (full reproducibility, 12 seeds) Partial 3–4 seeds reproduce the trend; full 12-seed set needs ~500+ compute-hours
  • Full-paper rerun is GPU-intensive. Reproducing all 10 benchmarks from scratch requires roughly 10–14 days of sequential compute on an RTX 4080. The quick-check path (purchase100_baseline, ≤ ~16 h; cifar10_baseline, ~33–34 h) is sufficient to verify the pipeline end-to-end and cross-check the paper's primary numerical claims.

  • Purchase-100 requires manual dataset staging. The dataset file must be downloaded from the link in data/Readme.md and placed under data/purchase/ before running any Purchase benchmark. All image datasets (CIFAR-10, CIFAR-100, GTSRB) download automatically on first use.

  • Figure 8 requires multiple benchmarks. The panel is composed from all 10 benchmarks (see configs/figure_panels/). Reviewers can run only a subset to get a partial figure.

  • Figure 11 (appendix) requires a top-k 16 rerun and manual assembly. The default run_analysis.py generates top9_* grids. Reproducing Figure 11 requires rerunning Phase 3 for seeds 1–3 with --top-k 16 --nrow 4, then arranging the six resulting PNG files in a 2 × 3 grid manually or via LaTeX.

  • Table 15 (appendix) requires the cifar10_tl benchmark. Reproducing it requires running cifar10_tl (~8–13 h) and then run_analysis.py on the resulting experiment directory.

  • Tested on WSL2 / Ubuntu 20.04 only. The artifact should work on any Linux system with CUDA, but has not been tested on macOS or native Windows. The Docker image has been built and tested on the same WSL2 host; reviewers on other Linux distributions can use it as a portable alternative to bare-metal Conda setup.


Notes on Reusability

The pipeline is fully config-driven and designed for extension:

  • New datasets: Add a dataset loader in utils/data_utils.py and a benchmark manifest YAML under configs/benchmarks/.
  • New architectures: Register the model in utils/model_utils.py; no other changes are needed.
  • New attack variants: Add a scorer in attacks/lira.py; the post-analysis scripts pick up any score file matching the naming convention in OUTPUTS.md.
  • Partial reruns: The --stages, --skip-existing, and --override flags on python scripts/run_benchmark.py allow fine-grained control over which pipeline stages are (re-)executed.
  • Sharded training: The training.start_shadow_model_idx / training.end_shadow_model_idx overrides allow shadow-model training to be distributed across multiple machines and merged afterward.

The benchmark manifest system (configs/benchmarks/) is a lightweight, self-contained contract format that can be adapted to replicate the evaluation protocol of this paper on entirely new settings.