Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 110 additions & 36 deletions src/autocast/configs/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@ python -m autocast.scripts.eval.encoder_processor_decoder \
All eval configs support these parameters:

- `checkpoint`: Path to model checkpoint (required for evaluation)
- `mode`: Evaluation regime (`auto` | `ambient` | `latent`). Controls the
**rollout space**, not just the metrics space. See
[Ambient vs latent rollout](#ambient-vs-latent-rollout) below.
- `mode`: Evaluation regime (`auto` (default) | `encode_once` | `ambient` |
`latent`). Controls the **rollout space**, not just the metrics space.
`auto` dispatches to a concrete mode at run time based on the checkpoint
and datamodule, so omitting the flag gives the fair default for every
run. See [Evaluation modes](#evaluation-modes) below.
- `metrics`: List of metrics to compute (default includes mse/mae/rmse/vrmse,
power spectrum scores `psrmse*`, cross-correlation spectrum scores `pscc*`,
and ensemble scores `crps`, `fcrps`, `afcrps`, `energy`, `ssr`; `variogram`
Expand Down Expand Up @@ -83,50 +85,122 @@ process so Fabric DDP initialises automatically — no extra flags needed.
- `max_rollout_steps`: Maximum number of rollout steps
- `free_running_only`: Whether to disable teacher forcing

## Ambient vs latent rollout

Processor checkpoints trained on cached latents can be evaluated in two
qualitatively different regimes. The `eval.mode` knob makes the choice
explicit and surfaces clear errors when the rest of the config is
inconsistent with the request.

- `eval.mode=auto` (default) preserves historical behavior: the script picks
a path based on `(checkpoint type, datamodule batch type,
autoencoder_checkpoint)`.
- `eval.mode=ambient` forces full `encoder -> processor -> decoder` rollout.
Each rollout step decodes to ambient fields and re-encodes on the next
step, so decode/encode drift is included in the metrics. **This is the
apples-to-apples regime for comparing against baselines that natively roll
out in data space (e.g. a CRPS comparison against a non-autoencoder
model).** Requires `autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch
datamodule. When the current datamodule yields `EncodedBatch` (cached
latents), eval auto-substitutes the datamodule from
`<cache_dir>/autoencoder_config.yaml` saved by `autocast cache-latents`.
Pass `datamodule=...` explicitly to override the default.
- `eval.mode=latent` forces latent-space rollout: the processor's predicted
latent is fed back as the next latent input; the encoder is invoked only
once. Metrics are decoded to data space via the decoder saved alongside
the cached latents when available, otherwise they are reported in latent
space. Requires an `EncodedBatch` / cached-latents datamodule.

### Running the ambient ablation
## Evaluation modes

The `eval.mode` knob controls the **rollout space** and what the metrics
compare against. The three concrete modes give the same answer on single-
step (windowed test) metrics; they only diverge during free-running
rollout. `auto` is a dispatcher that picks one of the concrete modes at
run time.

| mode | encoder runs | processor rolls out in | decoder runs | ground truth used | when to use |
| ------------- | ------------------- | ---------------------- | ------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------ |
| `encode_once` | **once** (step 0) | latent space | per step | raw `batch.output_fields` (denormalized) | fair processor-only eval that avoids decode/encode drift but still scores against real ground truth. |
| `ambient` | per rollout step | data space (re-encoded each step) | per step | raw `batch.output_fields` (denormalized) | apples-to-apples comparisons with pure-ambient baselines (e.g. CRPS vs. a non-autoencoder model). |
| `latent` | once (step 0) | latent space | only for metrics (or skipped via `latent_space_metrics=true`) | **decoded cached latents** (autoencoder reconstruction of ground truth) | measure the processor against what the autoencoder sees -- isolates processor error but hides AE reconstruction error. |

### `auto` (default)

`eval.mode=auto` dispatches to the faithful concrete mode for the current
run:

- **Full EPD checkpoints** (including processor runs with stateless
encoder/decoder baked in, e.g. `permute_concat` + `identity`) -> `ambient`.
`encode_once` and `ambient` are numerically identical here; `auto`
picks `ambient` to keep logs quiet. Passing `eval.mode=encode_once`
explicitly on such a run still works but emits a warning.
- **Processor trained on cached latents + autoencoder available**
(either via `autoencoder_checkpoint=<ae.ckpt>` or via
`<cache_dir>/autoencoder_config.yaml`) -> `encode_once`. Strictly fairer
than `ambient` (no drift penalty) **and** than `latent` (AE
reconstruction error is visible against raw ground truth).
- **Processor trained on cached latents, autoencoder not reachable**
-> `latent`. The only faithful option when you can decode but not
re-encode. If no decoder can be built either, `auto` does not silently
fall through to latent-only metrics -- it fails fast so you either fix
the autoencoder path or opt in explicitly via
`eval.mode=latent eval.latent_space_metrics=true`.

The resolved mode is logged at INFO as `eval.mode=auto resolved to <X>`.

### Explicit modes

#### Ambient: apples-to-apples with pure-ambient baselines

`eval.mode=ambient` forces full `encoder -> processor -> decoder` at every
rollout step. The decoded field is re-encoded as the next step's input, so
autoencoder decode/encode drift compounds into the metrics. This is the
right regime when the baseline model operates natively in data space and you
want to charge the autoencoder for any error it introduces. Requires
`autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch datamodule. When the
current datamodule yields `EncodedBatch` (cached latents), eval
auto-substitutes the datamodule from `<cache_dir>/autoencoder_config.yaml`
saved by `autocast cache-latents` (pass `datamodule=...` explicitly to
override).

#### Latent: measure the processor against the AE's view of the world

`eval.mode=latent` forces latent-space rollout: the processor's predicted
latent is fed back as the next latent input and the encoder is never
invoked past step 0. Metrics are decoded to data space via the decoder
saved alongside the cached latents and **compared against decoded cached
latents** -- i.e. an autoencoder reconstruction of ground truth, not the
raw fields. Use this when you want to isolate the processor's rollout
quality in its own training distribution and explicitly accept that AE
reconstruction error is hidden from the metric.

A reachable decoder is required; if the cache directory's
`autoencoder_config.yaml` or checkpoint is missing the run fails fast
rather than silently falling back to computing metrics in raw latent
space (those numbers were never comparable across runs).

##### Dev sense-check: latent-only metrics

Sometimes you want to iterate on a small processor paired with a large /
expensive autoencoder and skip the decoder entirely. Pass
`eval.mode=latent eval.latent_space_metrics=true` to opt in:

```bash
autocast eval --workdir <processor_workdir> \
eval.mode=latent \
eval.latent_space_metrics=true \
eval.checkpoint=<processor.ckpt>
```

This skips the decoder lookup and compares processor predictions against
cached latents directly in the autoencoder's raw latent space. Treat the
numbers as a cheap sanity check only: they are **not comparable across
runs** (latent space is basis-dependent) and physics-aware metrics
(`psrmse*`, `pscc*`, `variogram`) are not meaningful. The flag is
rejected for any other `eval.mode` because the raw-space modes (`auto`,
`ambient`, `encode_once`) require a decoder by definition.

### Running the ablations

Given an autoencoder checkpoint and a processor checkpoint trained on its
cached latents, a minimal invocation is:
cached latents:

```bash
# Ambient (encoder -> processor -> decoder at every rollout step)
# Default: auto -> encode_once here (fair processor-only eval, raw ground truth).
autocast eval --workdir <processor_workdir> \
eval.checkpoint=<processor.ckpt> \
autoencoder_checkpoint=<autoencoder.ckpt>

# Apples-to-apples with pure-ambient baselines (charges AE drift).
autocast eval --workdir <processor_workdir> \
eval.mode=ambient \
eval.checkpoint=<processor.ckpt> \
autoencoder_checkpoint=<autoencoder.ckpt>

# Latent (processor rollout stays in latent space; decoded only for metrics)
# Processor-only latent view; no raw ground truth, hides AE reconstruction error.
autocast eval --workdir <processor_workdir> \
eval.mode=latent \
eval.checkpoint=<processor.ckpt>
```

The ambient run will differ from the latent run by exactly the
decode/encode drift accumulated over rollout steps, which is the relevant
delta when comparing against purely-ambient baselines.
The three runs differ on rollout metrics as follows:

- `ambient - encode_once` = decode/encode drift accumulated over rollout
steps (charged to the autoencoder).
- `encode_once - latent` = visibility of AE reconstruction error against the
raw field (absent from `latent`, included in `encode_once`).
28 changes: 10 additions & 18 deletions src/autocast/configs/eval/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,18 @@
# Path to checkpoint for evaluation (required for eval)
checkpoint: null

# Evaluation mode selector (controls rollout space, not just metrics space).
#
# auto (default) infer from checkpoint type + batch type + autoencoder_checkpoint.
# Preserves historical behavior.
# ambient Force full encoder -> processor -> decoder rollout. Each rollout step
# decodes and re-encodes, so decode/encode drift is included in the
# metrics -- this is the apples-to-apples regime for comparing against
# models that natively roll out in ambient/data space (e.g. CRPS baselines).
# Requires `autoencoder_checkpoint=<ae.ckpt>` and a raw-Batch datamodule.
# When the datamodule yields EncodedBatch (cached latents), the eval
# script auto-substitutes the datamodule from
# `<cache_dir>/autoencoder_config.yaml` written by `autocast cache-latents`.
# Pass `datamodule=...` explicitly to override that default.
# latent Force latent-space rollout (processor predictions are fed back as
# latents; encoder is not re-invoked). Metrics are decoded to data
# space via the decoder saved alongside the cached latents if
# available, otherwise computed in latent space. Requires an
# EncodedBatch datamodule (cached latents).
# Evaluation mode selector; controls rollout space, not just metrics space.
# Values: auto (default dispatcher) | encode_once | ambient | latent.
# See `autocast/configs/eval/README.md` for the full comparison and the
# auto-dispatch rules; the resolved mode is logged at INFO.
mode: auto

# Dev sense-check: compute metrics directly in raw latent space. Only
# honored with an explicit `eval.mode=latent` and skips the decoder
# entirely. Numbers are not comparable across runs and physics-aware
# metrics are not meaningful -- leave as `false` for real evals.
latent_space_metrics: false

# Evaluation metrics to compute
metrics:
- mse
Expand Down
Loading
Loading