@@ -5,8 +5,8 @@ submitter only targets a study-specific ablation run set, keep it under
55` slurm_scripts/ablations/<name>/eval/ ` until that run set is promoted into the
66main comparison.
77
8- Six submission scripts cover ambient and cached-latent checkpoints produced
9- under ` outputs/2026-04-18/ ` and ` outputs/2026-04-19 / ` . Each script iterates
8+ Four submission scripts cover ambient and cached-latent checkpoints produced
9+ under ` outputs/2026-04-18/ ` and ` outputs/2026-04-20 / ` . Each script iterates
1010` --dry-run ` first, then submits for real.
1111
1212All comparison eval submitters explicitly pass ` eval.n_members=10 ` for now so
@@ -16,10 +16,8 @@ comparison numbers do not silently drift if the global eval default changes.
1616| ---| ---| ---| ---|
1717| ` submit_eval_crps_ambient.sh ` | ` outputs/2026-04-18/crps_* ` (4 primary + 2 CNS ablations) | default (auto → ambient) | 8 |
1818| ` submit_eval_fm_ambient.sh ` | ` outputs/2026-04-18/diff_* ` ambient (4 datasets) | default (auto → ambient) | 4 |
19- | ` submit_eval_crps_latent.sh ` | ` outputs/2026-04-19/crps_* ` cached-latent (CNS so far) | ` ambient ` | 8 |
20- | ` submit_eval_fm_latent.sh ` | ` outputs/2026-04-18/diff_* ` cached-latent (4 datasets) | ` ambient ` | 4 |
21- | ` submit_eval_crps_latent_rollout_latent.sh ` | same runs as ` submit_eval_crps_latent.sh ` | ` latent ` (writes to ` eval_latent/ ` ) | 8 |
22- | ` submit_eval_fm_latent_rollout_latent.sh ` | same runs as ` submit_eval_fm_latent.sh ` | ` latent ` (writes to ` eval_latent/ ` ) | 4 |
19+ | ` submit_eval_crps_latent.sh ` | ` outputs/2026-04-20/crps_* ` cached-latent (CNS so far) | default (` auto -> encode_once ` ) | 8 |
20+ | ` submit_eval_fm_latent.sh ` | ` outputs/2026-04-20/diff_* ` cached-latent (4 datasets) | default (` auto -> encode_once ` ) | 4 |
2321
2422## Batch-size rationale
2523
@@ -29,30 +27,22 @@ for 25 steps on 64×64 fields:
2927- ** CRPS** (single forward per step) handles ` eval.batch_size=8 ` fine.
3028- ** FM / diffusion** integrates ` flow_ode_steps=50 ` per rollout step, so
3129 ambient fits ` eval.batch_size=4 ` — drop to 2 if OOM.
32- - ** Cached-latent in ambient mode** still encodes/decodes at every step
33- but the processor forward is cheaper (64 tokens vs 256 for
34- ambient-patch4), so the CRPS variant matches ambient CRPS at 8 and the
35- FM variant matches ambient FM at 4. Can try bumping up if there's
36- headroom.
37- - ** Cached-latent in latent mode** avoids per-step AE encode/decode and is
38- typically cheaper. We keep 8 (CRPS) / 4 (FM) for consistency across
39- comparisons; increase only after confirming cluster headroom.
30+ - ** Cached-latent via ` auto -> encode_once ` ** encodes once up front,
31+ decodes per step, and scores against raw ground truth. It is cheaper
32+ than the ambient ablation while still being faithful for processor-only
33+ evaluation, so the CRPS variant stays at 8 and the FM variant stays at 4
34+ for easy comparison with the ambient scripts.
4035
4136## eval.mode for cached latents
4237
43- The cached-latent scripts use the ` eval.mode ` selector that landed via
44- [ PR #327 ] ( https://github.com/alan-turing-institute/autocast/pull/327 ) and is
45- now available in-tree. ` eval.mode=ambient ` forces full
46- ` encoder → processor → decoder ` rollout, so the decode/encode drift is
47- included in the metrics — the only fair regime for cross-comparison with
48- ambient CRPS/FM baselines that roll out in data space natively. Latent-only
49- rollout (` eval.mode=latent ` ) is faster and is useful as an additional
50- diagnostic view when written to a separate subdir (` eval_latent/ ` ).
51-
52- When ` eval.mode=ambient ` is set on a cached-latents datamodule, the eval
53- script auto-substitutes the raw datamodule from
54- ` <cache_dir>/autoencoder_config.yaml ` , and the AE weights are supplied via
55- ` autoencoder_checkpoint=<ae.ckpt> ` (hard-coded per run in each script).
38+ The cached-latent comparison scripts now rely on the default
39+ ` eval.mode=auto ` , which resolves to ` encode_once ` for processor-only
40+ cached-latent runs when ` autoencoder_checkpoint=<ae.ckpt> ` is supplied.
41+ That behavior landed in
42+ [ PR #339 ] ( https://github.com/alan-turing-institute/autocast/pull/339 ) .
43+ It keeps metrics in raw data space while avoiding the extra decode/encode
44+ drift charged by the explicit ambient ablation. That is now the only
45+ comparison-suite path we keep under ` slurm_scripts/comparison/eval/ ` .
5646
5747## Submission order
5848
@@ -61,5 +51,4 @@ checkpoint. There are no branch prerequisites for the cached-latent scripts.
6151
6252Dry-run everything first, review the printed sbatch commands, then re-run
6353without ` RUN_DRY_STATES ` edits to submit. Outputs land under each run's
64- ` eval/ ` (ambient rollout) or ` eval_latent/ ` (latent rollout) subdirectory
65- (` evaluation_metrics.csv ` , rollout videos, etc.).
54+ ` eval/ ` subdirectory (` evaluation_metrics.csv ` , rollout videos, etc.).
0 commit comments