Skip to content

Commit 724223d

Browse files
authored
Merge pull request #335 from alan-turing-institute/2026-04-19/eval-scripts
Update eval scripts and default metrics
2 parents 9830e1d + 5f0537d commit 724223d

13 files changed

Lines changed: 607 additions & 3 deletions

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,5 @@ core
1818
.clinerules
1919
.cursorrules
2020
.github/copilot-instructions.md
21-
.agent
21+
.agent
22+
.codex
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Eval scripts for the comparison study
2+
3+
Six submission scripts cover ambient and cached-latent checkpoints produced
4+
under `outputs/2026-04-18/` and `outputs/2026-04-19/`. Each script iterates
5+
`--dry-run` first, then submits for real.
6+
7+
| script | runs covered | eval.mode | eval.batch_size |
8+
|---|---|---|---|
9+
| `submit_eval_crps_ambient.sh` | `outputs/2026-04-18/crps_*` (4 primary + 2 CNS ablations) | default (auto → ambient) | 8 |
10+
| `submit_eval_fm_ambient.sh` | `outputs/2026-04-18/diff_*` ambient (4 datasets) | default (auto → ambient) | 4 |
11+
| `submit_eval_crps_latent.sh` | `outputs/2026-04-19/crps_*` cached-latent (CNS so far) | `ambient` | 8 |
12+
| `submit_eval_fm_latent.sh` | `outputs/2026-04-18/diff_*` cached-latent (4 datasets) | `ambient` | 4 |
13+
| `submit_eval_crps_latent_rollout_latent.sh` | same runs as `submit_eval_crps_latent.sh` | `latent` (writes to `eval_latent/`) | 8 |
14+
| `submit_eval_fm_latent_rollout_latent.sh` | same runs as `submit_eval_fm_latent.sh` | `latent` (writes to `eval_latent/`) | 4 |
15+
16+
## Batch-size rationale
17+
18+
Empirically, the knobs are tight because eval rolls out with n_members=10
19+
for 25 steps on 64×64 fields:
20+
21+
- **CRPS** (single forward per step) handles `eval.batch_size=8` fine.
22+
- **FM / diffusion** integrates `flow_ode_steps=50` per rollout step, so
23+
ambient fits `eval.batch_size=4` — drop to 2 if OOM.
24+
- **Cached-latent in ambient mode** still encodes/decodes at every step
25+
but the processor forward is cheaper (64 tokens vs 256 for
26+
ambient-patch4), so the CRPS variant matches ambient CRPS at 8 and the
27+
FM variant matches ambient FM at 4. Can try bumping up if there's
28+
headroom.
29+
- **Cached-latent in latent mode** avoids per-step AE encode/decode and is
30+
typically cheaper. We keep 8 (CRPS) / 4 (FM) for consistency across
31+
comparisons; increase only after confirming cluster headroom.
32+
33+
## eval.mode for cached latents
34+
35+
The cached-latent scripts use the `eval.mode` selector that landed via
36+
[PR #327](https://github.com/alan-turing-institute/autocast/pull/327) and is
37+
now available in-tree. `eval.mode=ambient` forces full
38+
`encoder → processor → decoder` rollout, so the decode/encode drift is
39+
included in the metrics — the only fair regime for cross-comparison with
40+
ambient CRPS/FM baselines that roll out in data space natively. Latent-only
41+
rollout (`eval.mode=latent`) is faster and is useful as an additional
42+
diagnostic view when written to a separate subdir (`eval_latent/`).
43+
44+
When `eval.mode=ambient` is set on a cached-latents datamodule, the eval
45+
script auto-substitutes the raw datamodule from
46+
`<cache_dir>/autoencoder_config.yaml`, and the AE weights are supplied via
47+
`autoencoder_checkpoint=<ae.ckpt>` (hard-coded per run in each script).
48+
49+
## Submission order
50+
51+
These scripts are all independent — each run eval'd against its own
52+
checkpoint. There are no branch prerequisites for the cached-latent scripts.
53+
54+
Dry-run everything first, review the printed sbatch commands, then re-run
55+
without `RUN_DRY_STATES` edits to submit. Outputs land under each run's
56+
`eval/` (ambient rollout) or `eval_latent/` (latent rollout) subdirectory
57+
(`evaluation_metrics.csv`, rollout videos, etc.).
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
# Evaluate CRPS-in-ambient EPD runs trained on 2026-04-18.
5+
# Covers 4 primary runs (permute_concat across all 4 datasets) plus two CNS
6+
# ablations: AE-ambient (DC encoder/decoder, frozen) and identity+global_cond.
7+
# All are EPD checkpoints (encoder_processor_decoder.ckpt); eval uses the
8+
# resolved_config.yaml written alongside each run, so the trained architecture
9+
# is reproduced exactly for eval.
10+
#
11+
# Batch size: CRPS eval fits 8/GPU comfortably (ambient 64x64, n_members=10,
12+
# single forward pass per rollout step — no ODE).
13+
14+
EVAL_BATCH_SIZE=8
15+
TIMEOUT_MIN=240
16+
RUN_DRY_STATES=("true" "false")
17+
EVAL_METRICS="[mse,mae,nmse,nmae,rmse,nrmse,vmse,vrmse,linf,psrmse,psrmse_low,psrmse_mid,psrmse_high,psrmse_tail,pscc,pscc_low,pscc_mid,pscc_high,pscc_tail,crps,fcrps,afcrps,energy,ssr,winkler]"
18+
19+
# Run dirs (absolute paths work; relative paths resolved from repo root).
20+
RUN_DIRS=(
21+
# CRPS ambient (permute_concat) — 4 datasets
22+
"outputs/2026-04-18/crps_gs64_vit_azula_large_0f89f06_779325a"
23+
"outputs/2026-04-18/crps_gpe64_vit_azula_large_0f89f06_d337bd8"
24+
"outputs/2026-04-18/crps_cns64_vit_azula_large_0f89f06_5b7332b"
25+
"outputs/2026-04-18/crps_ad64_vit_azula_large_0f89f06_4667606"
26+
# CNS ablations
27+
"outputs/2026-04-18/crps_cns64_vit_azula_large_0f89f06_cf53b48" # identity+global_cond
28+
"outputs/2026-04-18/crps_cns64_vit_azula_large_0f89f06_e7e60d9" # AE-ambient (DC encoder/decoder)
29+
)
30+
31+
for run_dir in "${RUN_DIRS[@]}"; do
32+
if [[ ! -f "${run_dir}/resolved_config.yaml" ]]; then
33+
echo "Skipping ${run_dir}: resolved_config.yaml missing" >&2
34+
continue
35+
fi
36+
37+
for run_dry in "${RUN_DRY_STATES[@]}"; do
38+
dry_run_arg=()
39+
run_label="slurm"
40+
if [[ "${run_dry}" == "true" ]]; then
41+
dry_run_arg=(--dry-run)
42+
run_label="slurm --dry-run"
43+
fi
44+
45+
echo "Submitting CRPS-ambient eval"
46+
echo " mode: ${run_label}"
47+
echo " run_dir: ${run_dir}"
48+
echo " eval.batch_size: ${EVAL_BATCH_SIZE}"
49+
echo " eval.metrics: ${EVAL_METRICS}"
50+
51+
uv run autocast eval --mode slurm "${dry_run_arg[@]}" \
52+
--workdir "${run_dir}" \
53+
eval.metrics="${EVAL_METRICS}" \
54+
eval.batch_size="${EVAL_BATCH_SIZE}" \
55+
hydra.launcher.timeout_min="${TIMEOUT_MIN}"
56+
done
57+
done
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
# Evaluate CRPS cached-latent processor runs (2026-04-19) in AMBIENT mode.
5+
#
6+
# eval.mode=ambient forces encoder->processor->decoder rollout at every
7+
# step, so decode/encode drift is included in the metrics. This makes the
8+
# latent-space CRPS numbers directly comparable with the ambient CRPS and
9+
# FM baselines (see slurm_scripts/comparison/eval/README.md).
10+
#
11+
# The eval.mode selector landed via PR #327 and is now in-tree. When ambient
12+
# is requested on a cached-latents datamodule, eval auto-substitutes the raw
13+
# datamodule from <cache_dir>/autoencoder_config.yaml; the trained AE weights
14+
# are supplied via autoencoder_checkpoint.
15+
#
16+
# Batch size: cached-latent eval pays the ambient AE encode/decode per step
17+
# but processor forward is cheap (64 tokens vs 256 for ambient-patch4), so
18+
# 8/GPU fits comfortably — same as pure-ambient CRPS.
19+
20+
EVAL_BATCH_SIZE=8
21+
TIMEOUT_MIN=240
22+
RUN_DRY_STATES=("true" "false")
23+
EVAL_METRICS="[mse,mae,nmse,nmae,rmse,nrmse,vmse,vrmse,linf,psrmse,psrmse_low,psrmse_mid,psrmse_high,psrmse_tail,pscc,pscc_low,pscc_mid,pscc_high,pscc_tail,crps,fcrps,afcrps,energy,ssr,winkler]"
24+
25+
# (run_dir, autoencoder_checkpoint) pairs. Extend as more cached-latent CRPS
26+
# runs land (gs, gpe, ad) — the AE paths are the same as training.
27+
RUN_DIRS=(
28+
"outputs/2026-04-19/crps_cns64_vit_azula_large_58712c4_71ba7be"
29+
)
30+
declare -A AE_CKPT=(
31+
["outputs/2026-04-19/crps_cns64_vit_azula_large_58712c4_71ba7be"]="$HOME/autocast/outputs/2026-04-17/ae_cns64_3a7999b_b9c29f8/autoencoder.ckpt"
32+
)
33+
34+
for run_dir in "${RUN_DIRS[@]}"; do
35+
ae_ckpt="${AE_CKPT[$run_dir]:-}"
36+
if [[ -z "${ae_ckpt}" ]]; then
37+
echo "Skipping ${run_dir}: no autoencoder_checkpoint mapping" >&2
38+
continue
39+
fi
40+
if [[ ! -f "${run_dir}/resolved_config.yaml" ]]; then
41+
echo "Skipping ${run_dir}: resolved_config.yaml missing" >&2
42+
continue
43+
fi
44+
if [[ ! -f "${ae_ckpt}" ]]; then
45+
echo "Skipping ${run_dir}: AE checkpoint missing at ${ae_ckpt}" >&2
46+
continue
47+
fi
48+
49+
for run_dry in "${RUN_DRY_STATES[@]}"; do
50+
dry_run_arg=()
51+
run_label="slurm"
52+
if [[ "${run_dry}" == "true" ]]; then
53+
dry_run_arg=(--dry-run)
54+
run_label="slurm --dry-run"
55+
fi
56+
57+
echo "Submitting CRPS cached-latent eval (mode=ambient)"
58+
echo " mode: ${run_label}"
59+
echo " run_dir: ${run_dir}"
60+
echo " autoencoder_checkpoint: ${ae_ckpt}"
61+
echo " eval.batch_size: ${EVAL_BATCH_SIZE}"
62+
echo " eval.metrics: ${EVAL_METRICS}"
63+
64+
uv run autocast eval --mode slurm "${dry_run_arg[@]}" \
65+
--workdir "${run_dir}" \
66+
eval.checkpoint=processor.ckpt \
67+
++eval.mode=ambient \
68+
+autoencoder_checkpoint="${ae_ckpt}" \
69+
eval.metrics="${EVAL_METRICS}" \
70+
eval.batch_size="${EVAL_BATCH_SIZE}" \
71+
hydra.launcher.timeout_min="${TIMEOUT_MIN}"
72+
done
73+
done
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
# Evaluate CRPS cached-latent processor runs (2026-04-19) in LATENT mode.
5+
#
6+
# eval.mode=latent rolls out only in latent space and writes results to
7+
# eval_latent/ so ambient-vs-latent comparisons can coexist per run.
8+
#
9+
# The eval.mode selector landed via PR #327 and is now in-tree. We still pass
10+
# autoencoder_checkpoint to load the trained AE for eval setup/final decode.
11+
#
12+
# Batch size: latent rollout avoids per-step AE encode/decode, so 8/GPU is a
13+
# conservative setting and matches the ambient-compare CRPS script.
14+
15+
EVAL_BATCH_SIZE=8
16+
TIMEOUT_MIN=240
17+
RUN_DRY_STATES=("true" "false")
18+
EVAL_SUBDIR="eval_latent"
19+
EVAL_METRICS="[mse,mae,nmse,nmae,rmse,nrmse,vmse,vrmse,linf,psrmse,psrmse_low,psrmse_mid,psrmse_high,psrmse_tail,pscc,pscc_low,pscc_mid,pscc_high,pscc_tail,crps,fcrps,afcrps,energy,ssr,winkler]"
20+
21+
# (run_dir, autoencoder_checkpoint) pairs. Extend as more cached-latent CRPS
22+
# runs land (gs, gpe, ad) — the AE paths are the same as training.
23+
RUN_DIRS=(
24+
"outputs/2026-04-19/crps_cns64_vit_azula_large_58712c4_71ba7be"
25+
)
26+
declare -A AE_CKPT=(
27+
["outputs/2026-04-19/crps_cns64_vit_azula_large_58712c4_71ba7be"]="$HOME/autocast/outputs/2026-04-17/ae_cns64_3a7999b_b9c29f8/autoencoder.ckpt"
28+
)
29+
30+
for run_dir in "${RUN_DIRS[@]}"; do
31+
ae_ckpt="${AE_CKPT[$run_dir]:-}"
32+
if [[ -z "${ae_ckpt}" ]]; then
33+
echo "Skipping ${run_dir}: no autoencoder_checkpoint mapping" >&2
34+
continue
35+
fi
36+
if [[ ! -f "${run_dir}/resolved_config.yaml" ]]; then
37+
echo "Skipping ${run_dir}: resolved_config.yaml missing" >&2
38+
continue
39+
fi
40+
if [[ ! -f "${ae_ckpt}" ]]; then
41+
echo "Skipping ${run_dir}: AE checkpoint missing at ${ae_ckpt}" >&2
42+
continue
43+
fi
44+
45+
for run_dry in "${RUN_DRY_STATES[@]}"; do
46+
dry_run_arg=()
47+
run_label="slurm"
48+
if [[ "${run_dry}" == "true" ]]; then
49+
dry_run_arg=(--dry-run)
50+
run_label="slurm --dry-run"
51+
fi
52+
53+
echo "Submitting CRPS cached-latent eval (mode=latent)"
54+
echo " mode: ${run_label}"
55+
echo " run_dir: ${run_dir}"
56+
echo " autoencoder_checkpoint: ${ae_ckpt}"
57+
echo " eval.batch_size: ${EVAL_BATCH_SIZE}"
58+
echo " eval.metrics: ${EVAL_METRICS}"
59+
echo " output_dir: ${run_dir}/${EVAL_SUBDIR}"
60+
61+
uv run autocast eval --mode slurm "${dry_run_arg[@]}" \
62+
--workdir "${run_dir}" \
63+
eval.checkpoint=processor.ckpt \
64+
++eval.mode=latent \
65+
+autoencoder_checkpoint="${ae_ckpt}" \
66+
eval.metrics="${EVAL_METRICS}" \
67+
eval.batch_size="${EVAL_BATCH_SIZE}" \
68+
hydra.sweep.dir="${run_dir}/${EVAL_SUBDIR}" \
69+
hydra.launcher.timeout_min="${TIMEOUT_MIN}"
70+
done
71+
done
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
# Evaluate FM-in-ambient (flow matching, identity encoder) EPD runs from
5+
# 2026-04-18 across all 4 datasets. Eval reuses resolved_config.yaml so
6+
# flow_ode_steps (=50), hid_channels, and backbone match training.
7+
#
8+
# Batch size: diffusion rollout is ODE-integrated (flow_ode_steps=50) per
9+
# rollout step, so ambient 64x64 × n_members=10 × 50 ODE substeps is the
10+
# tightest of the three. 4/GPU fits; drop to 2 if OOM.
11+
12+
EVAL_BATCH_SIZE=4
13+
TIMEOUT_MIN=360
14+
RUN_DRY_STATES=("true" "false")
15+
EVAL_METRICS="[mse,mae,nmse,nmae,rmse,nrmse,vmse,vrmse,linf,psrmse,psrmse_low,psrmse_mid,psrmse_high,psrmse_tail,pscc,pscc_low,pscc_mid,pscc_high,pscc_tail,crps,fcrps,afcrps,energy,ssr,winkler]"
16+
17+
RUN_DIRS=(
18+
"outputs/2026-04-18/diff_gs64_flow_matching_vit_0f89f06_6e3a299"
19+
"outputs/2026-04-18/diff_gpe64_flow_matching_vit_0f89f06_3b3604d"
20+
"outputs/2026-04-18/diff_cns64_flow_matching_vit_0f89f06_483bb70"
21+
"outputs/2026-04-18/diff_ad64_flow_matching_vit_0f89f06_725d44a"
22+
)
23+
24+
for run_dir in "${RUN_DIRS[@]}"; do
25+
if [[ ! -f "${run_dir}/resolved_config.yaml" ]]; then
26+
echo "Skipping ${run_dir}: resolved_config.yaml missing" >&2
27+
continue
28+
fi
29+
30+
for run_dry in "${RUN_DRY_STATES[@]}"; do
31+
dry_run_arg=()
32+
run_label="slurm"
33+
if [[ "${run_dry}" == "true" ]]; then
34+
dry_run_arg=(--dry-run)
35+
run_label="slurm --dry-run"
36+
fi
37+
38+
echo "Submitting FM-ambient eval"
39+
echo " mode: ${run_label}"
40+
echo " run_dir: ${run_dir}"
41+
echo " eval.batch_size: ${EVAL_BATCH_SIZE}"
42+
echo " eval.metrics: ${EVAL_METRICS}"
43+
44+
uv run autocast eval --mode slurm "${dry_run_arg[@]}" \
45+
--workdir "${run_dir}" \
46+
eval.metrics="${EVAL_METRICS}" \
47+
eval.batch_size="${EVAL_BATCH_SIZE}" \
48+
hydra.launcher.timeout_min="${TIMEOUT_MIN}"
49+
done
50+
done
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
# Evaluate FM cached-latent processor runs (2026-04-18) in AMBIENT mode.
5+
#
6+
# eval.mode=ambient forces encoder->processor->decoder rollout at every
7+
# step, so decode/encode drift is included in the metrics — the apples-to-
8+
# apples regime for comparison with the ambient FM baseline.
9+
#
10+
# The eval.mode selector landed via PR #327 and is now in-tree. When ambient
11+
# is requested on a cached-latents datamodule, eval auto-substitutes the raw
12+
# datamodule from <cache_dir>/autoencoder_config.yaml; the trained AE weights
13+
# are supplied via autoencoder_checkpoint.
14+
#
15+
# Batch size: ambient rollout pays encode/decode every step plus 50 ODE
16+
# substeps through the processor. Cached-latent processor forward is lighter
17+
# (64 tokens vs 256 for ambient FM), so 4/GPU is a safe start; the tight
18+
# spot is the same ODE + AE stack so it mirrors FM-ambient.
19+
20+
EVAL_BATCH_SIZE=4
21+
TIMEOUT_MIN=360
22+
RUN_DRY_STATES=("true" "false")
23+
EVAL_METRICS="[mse,mae,nmse,nmae,rmse,nrmse,vmse,vrmse,linf,psrmse,psrmse_low,psrmse_mid,psrmse_high,psrmse_tail,pscc,pscc_low,pscc_mid,pscc_high,pscc_tail,crps,fcrps,afcrps,energy,ssr,winkler]"
24+
25+
RUN_DIRS=(
26+
"outputs/2026-04-18/diff_gs64_flow_matching_vit_0f89f06_f6e8f51"
27+
"outputs/2026-04-18/diff_gpe64_flow_matching_vit_0f89f06_b954f94"
28+
"outputs/2026-04-18/diff_cns64_flow_matching_vit_0f89f06_0e1c64b"
29+
"outputs/2026-04-18/diff_ad64_flow_matching_vit_0f89f06_df2137c"
30+
)
31+
declare -A AE_CKPT=(
32+
["outputs/2026-04-18/diff_gs64_flow_matching_vit_0f89f06_f6e8f51"]="$HOME/autocast/outputs/2026-04-17/ae_gs64_3a7999b_ed36b8e/autoencoder.ckpt"
33+
["outputs/2026-04-18/diff_gpe64_flow_matching_vit_0f89f06_b954f94"]="$HOME/autocast/outputs/2026-04-17/ae_gpe64_3a7999b_31e1c9f/autoencoder.ckpt"
34+
["outputs/2026-04-18/diff_cns64_flow_matching_vit_0f89f06_0e1c64b"]="$HOME/autocast/outputs/2026-04-17/ae_cns64_3a7999b_b9c29f8/autoencoder.ckpt"
35+
["outputs/2026-04-18/diff_ad64_flow_matching_vit_0f89f06_df2137c"]="$HOME/autocast/outputs/2026-04-17/ae_ad64_3a7999b_1a1e300/autoencoder.ckpt"
36+
)
37+
38+
for run_dir in "${RUN_DIRS[@]}"; do
39+
ae_ckpt="${AE_CKPT[$run_dir]:-}"
40+
if [[ -z "${ae_ckpt}" ]]; then
41+
echo "Skipping ${run_dir}: no autoencoder_checkpoint mapping" >&2
42+
continue
43+
fi
44+
if [[ ! -f "${run_dir}/resolved_config.yaml" ]]; then
45+
echo "Skipping ${run_dir}: resolved_config.yaml missing" >&2
46+
continue
47+
fi
48+
if [[ ! -f "${ae_ckpt}" ]]; then
49+
echo "Skipping ${run_dir}: AE checkpoint missing at ${ae_ckpt}" >&2
50+
continue
51+
fi
52+
53+
for run_dry in "${RUN_DRY_STATES[@]}"; do
54+
dry_run_arg=()
55+
run_label="slurm"
56+
if [[ "${run_dry}" == "true" ]]; then
57+
dry_run_arg=(--dry-run)
58+
run_label="slurm --dry-run"
59+
fi
60+
61+
echo "Submitting FM cached-latent eval (mode=ambient)"
62+
echo " mode: ${run_label}"
63+
echo " run_dir: ${run_dir}"
64+
echo " autoencoder_checkpoint: ${ae_ckpt}"
65+
echo " eval.batch_size: ${EVAL_BATCH_SIZE}"
66+
echo " eval.metrics: ${EVAL_METRICS}"
67+
68+
uv run autocast eval --mode slurm "${dry_run_arg[@]}" \
69+
--workdir "${run_dir}" \
70+
eval.checkpoint=processor.ckpt \
71+
++eval.mode=ambient \
72+
+autoencoder_checkpoint="${ae_ckpt}" \
73+
eval.metrics="${EVAL_METRICS}" \
74+
eval.batch_size="${EVAL_BATCH_SIZE}" \
75+
hydra.launcher.timeout_min="${TIMEOUT_MIN}"
76+
done
77+
done

0 commit comments

Comments
 (0)