|
| 1 | +# Ensemble size ablation |
| 2 | + |
| 3 | +First-pass defaults focus on `n_members=16` under two batch-size |
| 4 | +regimes. For the current submission pass, the active scripts are pared |
| 5 | +down to just three `eff_bs1024` runs on `gray_scott`, |
| 6 | +`gpe_laser_only_wake`, and `advection_diffusion`; the CNS entries and |
| 7 | +`fixed_bs32` combo are left commented for later reuse. All runs inherit |
| 8 | +from the matching per-dataset |
| 9 | +`local_hydra/local_experiment/epd/<dataset>/crps_vit_azula_large.yaml`; |
| 10 | +the ablation is a pure CLI override on `model.n_members` + |
| 11 | +`datamodule.batch_size`, so no new experiment configs are needed. |
| 12 | + |
| 13 | +## Knob map |
| 14 | + |
| 15 | +Main baseline is `bs_crps=32 × n_members=8 × 4 GPUs = 1024 global |
| 16 | +effective` (i.e. `256 effective per-GPU`). |
| 17 | + |
| 18 | +### Fixed batch size = 32/GPU (same as baseline) |
| 19 | + |
| 20 | +Keep `datamodule.batch_size=32` and set `n_members=16`. |
| 21 | +This doubles effective batch vs baseline. |
| 22 | + |
| 23 | +| n_members | bs_per_gpu | effective per-GPU | effective global | |
| 24 | +|---:|---:|---:|---:| |
| 25 | +| 16 | 32 | 512 | 2048 | |
| 26 | + |
| 27 | +### Fixed global effective batch = 1024 (matches baseline compute budget) |
| 28 | + |
| 29 | +Keep `bs_crps × n_members × 4 GPUs = 1024`. With `n_members=16`, |
| 30 | +`bs_per_gpu=16`. |
| 31 | + |
| 32 | +| n_members | bs_per_gpu | effective per-GPU | effective global | |
| 33 | +|---:|---:|---:|---:| |
| 34 | +| 16 | 16 | 256 | 1024 | |
| 35 | + |
| 36 | +## Dataset coverage |
| 37 | + |
| 38 | +| dataset | `fixed_bs32` | `eff_bs1024` | |
| 39 | +|---|---:|---:| |
| 40 | +| `conditioned_navier_stokes` | yes | yes | |
| 41 | +| `gray_scott` | no | yes | |
| 42 | +| `gpe_laser_only_wake` | no | yes | |
| 43 | +| `advection_diffusion` | no | yes | |
| 44 | + |
| 45 | +This keeps the original CNS pilot in reserve while the active submit |
| 46 | +scripts target only the three compute-matched (`1024` effective global |
| 47 | +batch) CRPS ablations on the other comparison datasets. |
| 48 | + |
| 49 | +## Files |
| 50 | + |
| 51 | +| file | purpose | |
| 52 | +|---|---| |
| 53 | +| `submit_ensemble_timing.sh` | 5-epoch timing for the three active `eff_bs1024` runs (`gray_scott`, `gpe_laser_only_wake`, `advection_diffusion`) → `timing.ckpt` per run | |
| 54 | +| `submit_ensemble_large.sh` | 24h production runs for the same three active runs, using cached or timing-derived cosine schedules | |
| 55 | +| `eval/submit_eval_crps_ambient.sh` | ambient eval for the current `m=16` CRPS run set (CNS `fixed_bs32` pilot plus all available `eff_bs1024` runs), with conservative `eval.batch_size=4` and explicit `eval.n_members=10` to match the comparison-study eval regime | |
| 56 | + |
| 57 | +## Extending the sweep |
| 58 | + |
| 59 | +Add more lines to `COMBOS` in both submit scripts. Invariants are checked |
| 60 | +per regime so bad tuples fail fast before any submission: |
| 61 | + |
| 62 | +- `fixed_bs32`: require `bs_per_gpu=32`; vary `n_members`. |
| 63 | +- `eff_bs1024`: require `bs_per_gpu × n_members × 4 GPUs = 1024`. |
| 64 | + |
| 65 | +Dataset coverage is controlled separately via `REGIMES_BY_DATASET` in |
| 66 | +each submit script, so extending `eff_bs1024` without broadening |
| 67 | +`fixed_bs32` is a one-line change per dataset. |
| 68 | + |
| 69 | +## Eval placement |
| 70 | + |
| 71 | +Ensemble-size eval now lives under `slurm_scripts/ablations/ensemble_size/eval/` |
| 72 | +rather than `slurm_scripts/comparison/eval/`. The reason is organizational: |
| 73 | +the run set is still partly ablation-only (`fixed_bs32`) even though the |
| 74 | +`eff_bs1024` subset may later graduate into the main comparison baseline. |
| 75 | + |
| 76 | +If that promotion happens, move the promoted run dirs into a comparison-level |
| 77 | +eval script and leave only the genuinely ablation-only runs here. |
| 78 | + |
| 79 | +## Scheduling |
| 80 | + |
| 81 | +`submit_ensemble_large.sh` first checks `COSINE_EPOCHS_BY_COMBO`. If a |
| 82 | +key is missing, it looks for the matching timing run |
| 83 | +`outputs/*/crps_<dataset>_<regime>_m<n_members>/timing.ckpt` and derives |
| 84 | +`trainer.max_epochs` on the fly with: |
| 85 | + |
| 86 | +`uv run autocast time-epochs --from-checkpoint <path>/timing.ckpt -b 24 -m 0.02` |
| 87 | + |
| 88 | +That means the added `gray_scott`, `gpe_laser_only_wake`, and |
| 89 | +`advection_diffusion` `eff_bs1024` runs become submit-ready as soon as |
| 90 | +their timing jobs finish, without another script edit. |
0 commit comments