Skip to content

Commit de86747

Browse files
authored
Merge pull request #339 from alan-turing-institute/2026-04-20/encode-once-eval
Add encode_once eval mode and auto dispatcher
2 parents a14f458 + 2cda8b8 commit de86747

4 files changed

Lines changed: 597 additions & 158 deletions

File tree

src/autocast/configs/eval/README.md

Lines changed: 110 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,11 @@ python -m autocast.scripts.eval.encoder_processor_decoder \
4444
All eval configs support these parameters:
4545

4646
- `checkpoint`: Path to model checkpoint (required for evaluation)
47-
- `mode`: Evaluation regime (`auto` | `ambient` | `latent`). Controls the
48-
**rollout space**, not just the metrics space. See
49-
[Ambient vs latent rollout](#ambient-vs-latent-rollout) below.
47+
- `mode`: Evaluation regime (`auto` (default) | `encode_once` | `ambient` |
48+
`latent`). Controls the **rollout space**, not just the metrics space.
49+
`auto` dispatches to a concrete mode at run time based on the checkpoint
50+
and datamodule, so omitting the flag gives the fair default for every
51+
run. See [Evaluation modes](#evaluation-modes) below.
5052
- `metrics`: List of metrics to compute (default includes mse/mae/rmse/vrmse,
5153
power spectrum scores `psrmse*`, cross-correlation spectrum scores `pscc*`,
5254
and ensemble scores `crps`, `fcrps`, `afcrps`, `energy`, `ssr`; `variogram`
@@ -83,50 +85,122 @@ process so Fabric DDP initialises automatically — no extra flags needed.
8385
- `max_rollout_steps`: Maximum number of rollout steps
8486
- `free_running_only`: Whether to disable teacher forcing
8587

86-
## Ambient vs latent rollout
87-
88-
Processor checkpoints trained on cached latents can be evaluated in two
89-
qualitatively different regimes. The `eval.mode` knob makes the choice
90-
explicit and surfaces clear errors when the rest of the config is
91-
inconsistent with the request.
92-
93-
- `eval.mode=auto` (default) preserves historical behavior: the script picks
94-
a path based on `(checkpoint type, datamodule batch type,
95-
autoencoder_checkpoint)`.
96-
- `eval.mode=ambient` forces full `encoder -> processor -> decoder` rollout.
97-
Each rollout step decodes to ambient fields and re-encodes on the next
98-
step, so decode/encode drift is included in the metrics. **This is the
99-
apples-to-apples regime for comparing against baselines that natively roll
100-
out in data space (e.g. a CRPS comparison against a non-autoencoder
101-
model).** Requires `autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch
102-
datamodule. When the current datamodule yields `EncodedBatch` (cached
103-
latents), eval auto-substitutes the datamodule from
104-
`<cache_dir>/autoencoder_config.yaml` saved by `autocast cache-latents`.
105-
Pass `datamodule=...` explicitly to override the default.
106-
- `eval.mode=latent` forces latent-space rollout: the processor's predicted
107-
latent is fed back as the next latent input; the encoder is invoked only
108-
once. Metrics are decoded to data space via the decoder saved alongside
109-
the cached latents when available, otherwise they are reported in latent
110-
space. Requires an `EncodedBatch` / cached-latents datamodule.
111-
112-
### Running the ambient ablation
88+
## Evaluation modes
89+
90+
The `eval.mode` knob controls the **rollout space** and what the metrics
91+
compare against. The three concrete modes give the same answer on single-
92+
step (windowed test) metrics; they only diverge during free-running
93+
rollout. `auto` is a dispatcher that picks one of the concrete modes at
94+
run time.
95+
96+
| mode | encoder runs | processor rolls out in | decoder runs | ground truth used | when to use |
97+
| ------------- | ------------------- | ---------------------- | ------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------ |
98+
| `encode_once` | **once** (step 0) | latent space | per step | raw `batch.output_fields` (denormalized) | fair processor-only eval that avoids decode/encode drift but still scores against real ground truth. |
99+
| `ambient` | per rollout step | data space (re-encoded each step) | per step | raw `batch.output_fields` (denormalized) | apples-to-apples comparisons with pure-ambient baselines (e.g. CRPS vs. a non-autoencoder model). |
100+
| `latent` | once (step 0) | latent space | only for metrics (or skipped via `latent_space_metrics=true`) | **decoded cached latents** (autoencoder reconstruction of ground truth) | measure the processor against what the autoencoder sees -- isolates processor error but hides AE reconstruction error. |
101+
102+
### `auto` (default)
103+
104+
`eval.mode=auto` dispatches to the faithful concrete mode for the current
105+
run:
106+
107+
- **Full EPD checkpoints** (including processor runs with stateless
108+
encoder/decoder baked in, e.g. `permute_concat` + `identity`) -> `ambient`.
109+
`encode_once` and `ambient` are numerically identical here; `auto`
110+
picks `ambient` to keep logs quiet. Passing `eval.mode=encode_once`
111+
explicitly on such a run still works but emits a warning.
112+
- **Processor trained on cached latents + autoencoder available**
113+
(either via `autoencoder_checkpoint=<ae.ckpt>` or via
114+
`<cache_dir>/autoencoder_config.yaml`) -> `encode_once`. Strictly fairer
115+
than `ambient` (no drift penalty) **and** than `latent` (AE
116+
reconstruction error is visible against raw ground truth).
117+
- **Processor trained on cached latents, autoencoder not reachable**
118+
-> `latent`. The only faithful option when you can decode but not
119+
re-encode. If no decoder can be built either, `auto` does not silently
120+
fall through to latent-only metrics -- it fails fast so you either fix
121+
the autoencoder path or opt in explicitly via
122+
`eval.mode=latent eval.latent_space_metrics=true`.
123+
124+
The resolved mode is logged at INFO as `eval.mode=auto resolved to <X>`.
125+
126+
### Explicit modes
127+
128+
#### Ambient: apples-to-apples with pure-ambient baselines
129+
130+
`eval.mode=ambient` forces full `encoder -> processor -> decoder` at every
131+
rollout step. The decoded field is re-encoded as the next step's input, so
132+
autoencoder decode/encode drift compounds into the metrics. This is the
133+
right regime when the baseline model operates natively in data space and you
134+
want to charge the autoencoder for any error it introduces. Requires
135+
`autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch datamodule. When the
136+
current datamodule yields `EncodedBatch` (cached latents), eval
137+
auto-substitutes the datamodule from `<cache_dir>/autoencoder_config.yaml`
138+
saved by `autocast cache-latents` (pass `datamodule=...` explicitly to
139+
override).
140+
141+
#### Latent: measure the processor against the AE's view of the world
142+
143+
`eval.mode=latent` forces latent-space rollout: the processor's predicted
144+
latent is fed back as the next latent input and the encoder is never
145+
invoked past step 0. Metrics are decoded to data space via the decoder
146+
saved alongside the cached latents and **compared against decoded cached
147+
latents** -- i.e. an autoencoder reconstruction of ground truth, not the
148+
raw fields. Use this when you want to isolate the processor's rollout
149+
quality in its own training distribution and explicitly accept that AE
150+
reconstruction error is hidden from the metric.
151+
152+
A reachable decoder is required; if the cache directory's
153+
`autoencoder_config.yaml` or checkpoint is missing the run fails fast
154+
rather than silently falling back to computing metrics in raw latent
155+
space (those numbers were never comparable across runs).
156+
157+
##### Dev sense-check: latent-only metrics
158+
159+
Sometimes you want to iterate on a small processor paired with a large /
160+
expensive autoencoder and skip the decoder entirely. Pass
161+
`eval.mode=latent eval.latent_space_metrics=true` to opt in:
162+
163+
```bash
164+
autocast eval --workdir <processor_workdir> \
165+
eval.mode=latent \
166+
eval.latent_space_metrics=true \
167+
eval.checkpoint=<processor.ckpt>
168+
```
169+
170+
This skips the decoder lookup and compares processor predictions against
171+
cached latents directly in the autoencoder's raw latent space. Treat the
172+
numbers as a cheap sanity check only: they are **not comparable across
173+
runs** (latent space is basis-dependent) and physics-aware metrics
174+
(`psrmse*`, `pscc*`, `variogram`) are not meaningful. The flag is
175+
rejected for any other `eval.mode` because the raw-space modes (`auto`,
176+
`ambient`, `encode_once`) require a decoder by definition.
177+
178+
### Running the ablations
113179

114180
Given an autoencoder checkpoint and a processor checkpoint trained on its
115-
cached latents, a minimal invocation is:
181+
cached latents:
116182

117183
```bash
118-
# Ambient (encoder -> processor -> decoder at every rollout step)
184+
# Default: auto -> encode_once here (fair processor-only eval, raw ground truth).
185+
autocast eval --workdir <processor_workdir> \
186+
eval.checkpoint=<processor.ckpt> \
187+
autoencoder_checkpoint=<autoencoder.ckpt>
188+
189+
# Apples-to-apples with pure-ambient baselines (charges AE drift).
119190
autocast eval --workdir <processor_workdir> \
120191
eval.mode=ambient \
121192
eval.checkpoint=<processor.ckpt> \
122193
autoencoder_checkpoint=<autoencoder.ckpt>
123194

124-
# Latent (processor rollout stays in latent space; decoded only for metrics)
195+
# Processor-only latent view; no raw ground truth, hides AE reconstruction error.
125196
autocast eval --workdir <processor_workdir> \
126197
eval.mode=latent \
127198
eval.checkpoint=<processor.ckpt>
128199
```
129200

130-
The ambient run will differ from the latent run by exactly the
131-
decode/encode drift accumulated over rollout steps, which is the relevant
132-
delta when comparing against purely-ambient baselines.
201+
The three runs differ on rollout metrics as follows:
202+
203+
- `ambient - encode_once` = decode/encode drift accumulated over rollout
204+
steps (charged to the autoencoder).
205+
- `encode_once - latent` = visibility of AE reconstruction error against the
206+
raw field (absent from `latent`, included in `encode_once`).

src/autocast/configs/eval/default.yaml

Lines changed: 10 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,18 @@
22
# Path to checkpoint for evaluation (required for eval)
33
checkpoint: null
44

5-
# Evaluation mode selector (controls rollout space, not just metrics space).
6-
#
7-
# auto (default) infer from checkpoint type + batch type + autoencoder_checkpoint.
8-
# Preserves historical behavior.
9-
# ambient Force full encoder -> processor -> decoder rollout. Each rollout step
10-
# decodes and re-encodes, so decode/encode drift is included in the
11-
# metrics -- this is the apples-to-apples regime for comparing against
12-
# models that natively roll out in ambient/data space (e.g. CRPS baselines).
13-
# Requires `autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch datamodule.
14-
# When the datamodule yields EncodedBatch (cached latents), the eval
15-
# script auto-substitutes the datamodule from
16-
# `<cache_dir>/autoencoder_config.yaml` written by `autocast cache-latents`.
17-
# Pass `datamodule=...` explicitly to override that default.
18-
# latent Force latent-space rollout (processor predictions are fed back as
19-
# latents; encoder is not re-invoked). Metrics are decoded to data
20-
# space via the decoder saved alongside the cached latents if
21-
# available, otherwise computed in latent space. Requires an
22-
# EncodedBatch datamodule (cached latents).
5+
# Evaluation mode selector; controls rollout space, not just metrics space.
6+
# Values: auto (default dispatcher) | encode_once | ambient | latent.
7+
# See `autocast/configs/eval/README.md` for the full comparison and the
8+
# auto-dispatch rules; the resolved mode is logged at INFO.
239
mode: auto
2410

11+
# Dev sense-check: compute metrics directly in raw latent space. Only
12+
# honored with an explicit `eval.mode=latent` and skips the decoder
13+
# entirely. Numbers are not comparable across runs and physics-aware
14+
# metrics are not meaningful -- leave as `false` for real evals.
15+
latent_space_metrics: false
16+
2517
# Evaluation metrics to compute
2618
metrics:
2719
- mse

0 commit comments

Comments
 (0)