110 lines (88 loc) · 5.78 KB

Results Index

This file maps paper-facing runs and figures to their expected output paths after running the pipeline.

Benchmark Manifests

These are the exact benchmark definitions used by the script runner:

CIFAR-10 Analysis Runs

The paper used the following post-analysis runs:

Identical-seed set

Training-variation runs used in cross-run figures

analysis_results/cifar10/resnet18/mixup_drp0.15 mapped as +1 different (MixUp, DRP=0.15)
analysis_results/cifar10/resnet18/bs512_drp0.2 mapped as +1 different (BS=512, DRP=0.2)
analysis_results/cifar10/resnet18/tl mapped as +1 different (TL)
analysis_results/cifar10/wrn28-2/seed42 mapped as +1 different (Arch)

Figure Regeneration Sources

Figure 1

script: comprehensive_analysis/threshold_distribution.py
inputs: per_model_metrics_two_modes.csv from seed1 to seed12
outputs:
- analysis_results/figures/thresh_boxplot_single.pdf
- analysis_results/figures/thresh_boxplot_multi.pdf

Figures 2, 3, 5, 6

script: comprehensive_analysis/reproducibility_analysis.py
analysis-result inputs:
- seed1 to seed12
- mixup_drp0.15
- bs512_drp0.2
- tl
- wrn28-2/seed42
experiment-array inputs for rank stability:

Figure 4

per-run grids are generated by comprehensive_analysis/run_analysis.py
collected into one folder by comprehensive_analysis/compose_top_vulnerable.py
collected output: analysis_results/figures/topk_vulnerable_images/<dataset>/<arch>/<run>/
paper panel assembled manually from the collected images

Figure 7

script: comprehensive_analysis/loss_ratio_tpr.py
input:
- analysis_results/loss_ratio.csv (assembled manually from per-run summaries)
output:
- analysis_results/figures/lossratio_tpr.pdf

Figure 8

script: comprehensive_analysis/plot_benchmark_distribution.py
explicit panel manifests:
- configs/figure_panels/figure8_scores.yaml
- configs/figure_panels/figure8_ratios.yaml
output:
- analysis_results/figures/sample_inout_score.pdf
- analysis_results/figures/sample_inout_ratio.pdf

Figure 8 note:

the exact source experiment directories are recorded in the panel manifests
Figure 8 requires running all 10 benchmarks; a subset produces a partial panel