@@ -44,9 +44,11 @@ python -m autocast.scripts.eval.encoder_processor_decoder \
4444All eval configs support these parameters:
4545
4646- ` checkpoint ` : Path to model checkpoint (required for evaluation)
47- - ` mode ` : Evaluation regime (` auto ` | ` ambient ` | ` latent ` ). Controls the
48- ** rollout space** , not just the metrics space. See
49- [ Ambient vs latent rollout] ( #ambient-vs-latent-rollout ) below.
47+ - ` mode ` : Evaluation regime (` auto ` (default) | ` encode_once ` | ` ambient ` |
48+ ` latent ` ). Controls the ** rollout space** , not just the metrics space.
49+ ` auto ` dispatches to a concrete mode at run time based on the checkpoint
50+ and datamodule, so omitting the flag gives the fair default for every
51+ run. See [ Evaluation modes] ( #evaluation-modes ) below.
5052- ` metrics ` : List of metrics to compute (default includes mse/mae/rmse/vrmse,
5153 power spectrum scores ` psrmse* ` , cross-correlation spectrum scores ` pscc* ` ,
5254 and ensemble scores ` crps ` , ` fcrps ` , ` afcrps ` , ` energy ` , ` ssr ` ; ` variogram `
@@ -83,50 +85,122 @@ process so Fabric DDP initialises automatically — no extra flags needed.
8385- ` max_rollout_steps ` : Maximum number of rollout steps
8486- ` free_running_only ` : Whether to disable teacher forcing
8587
86- ## Ambient vs latent rollout
87-
88- Processor checkpoints trained on cached latents can be evaluated in two
89- qualitatively different regimes. The ` eval.mode ` knob makes the choice
90- explicit and surfaces clear errors when the rest of the config is
91- inconsistent with the request.
92-
93- - ` eval.mode=auto ` (default) preserves historical behavior: the script picks
94- a path based on `(checkpoint type, datamodule batch type,
95- autoencoder_checkpoint)`.
96- - ` eval.mode=ambient ` forces full ` encoder -> processor -> decoder ` rollout.
97- Each rollout step decodes to ambient fields and re-encodes on the next
98- step, so decode/encode drift is included in the metrics. ** This is the
99- apples-to-apples regime for comparing against baselines that natively roll
100- out in data space (e.g. a CRPS comparison against a non-autoencoder
101- model).** Requires ` autoencoder_checkpoint=<ae.ckpt> ` and a raw-Batch
102- datamodule. When the current datamodule yields ` EncodedBatch ` (cached
103- latents), eval auto-substitutes the datamodule from
104- ` <cache_dir>/autoencoder_config.yaml ` saved by ` autocast cache-latents ` .
105- Pass ` datamodule=... ` explicitly to override the default.
106- - ` eval.mode=latent ` forces latent-space rollout: the processor's predicted
107- latent is fed back as the next latent input; the encoder is invoked only
108- once. Metrics are decoded to data space via the decoder saved alongside
109- the cached latents when available, otherwise they are reported in latent
110- space. Requires an ` EncodedBatch ` / cached-latents datamodule.
111-
112- ### Running the ambient ablation
88+ ## Evaluation modes
89+
90+ The ` eval.mode ` knob controls the ** rollout space** and what the metrics
91+ compare against. The three concrete modes give the same answer on single-
92+ step (windowed test) metrics; they only diverge during free-running
93+ rollout. ` auto ` is a dispatcher that picks one of the concrete modes at
94+ run time.
95+
96+ | mode | encoder runs | processor rolls out in | decoder runs | ground truth used | when to use |
97+ | ------------- | ------------------- | ---------------------- | ------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------ |
98+ | ` encode_once ` | ** once** (step 0) | latent space | per step | raw ` batch.output_fields ` (denormalized) | fair processor-only eval that avoids decode/encode drift but still scores against real ground truth. |
99+ | ` ambient ` | per rollout step | data space (re-encoded each step) | per step | raw ` batch.output_fields ` (denormalized) | apples-to-apples comparisons with pure-ambient baselines (e.g. CRPS vs. a non-autoencoder model). |
100+ | ` latent ` | once (step 0) | latent space | only for metrics (or skipped via ` latent_space_metrics=true ` ) | ** decoded cached latents** (autoencoder reconstruction of ground truth) | measure the processor against what the autoencoder sees -- isolates processor error but hides AE reconstruction error. |
101+
102+ ### ` auto ` (default)
103+
104+ ` eval.mode=auto ` dispatches to the faithful concrete mode for the current
105+ run:
106+
107+ - ** Full EPD checkpoints** (including processor runs with stateless
108+ encoder/decoder baked in, e.g. ` permute_concat ` + ` identity ` ) -> ` ambient ` .
109+ ` encode_once ` and ` ambient ` are numerically identical here; ` auto `
110+ picks ` ambient ` to keep logs quiet. Passing ` eval.mode=encode_once `
111+ explicitly on such a run still works but emits a warning.
112+ - ** Processor trained on cached latents + autoencoder available**
113+ (either via ` autoencoder_checkpoint=<ae.ckpt> ` or via
114+ ` <cache_dir>/autoencoder_config.yaml ` ) -> ` encode_once ` . Strictly fairer
115+ than ` ambient ` (no drift penalty) ** and** than ` latent ` (AE
116+ reconstruction error is visible against raw ground truth).
117+ - ** Processor trained on cached latents, autoencoder not reachable**
118+ -> ` latent ` . The only faithful option when you can decode but not
119+ re-encode. If no decoder can be built either, ` auto ` does not silently
120+ fall through to latent-only metrics -- it fails fast so you either fix
121+ the autoencoder path or opt in explicitly via
122+ ` eval.mode=latent eval.latent_space_metrics=true ` .
123+
124+ The resolved mode is logged at INFO as ` eval.mode=auto resolved to <X> ` .
125+
126+ ### Explicit modes
127+
128+ #### Ambient: apples-to-apples with pure-ambient baselines
129+
130+ ` eval.mode=ambient ` forces full ` encoder -> processor -> decoder ` at every
131+ rollout step. The decoded field is re-encoded as the next step's input, so
132+ autoencoder decode/encode drift compounds into the metrics. This is the
133+ right regime when the baseline model operates natively in data space and you
134+ want to charge the autoencoder for any error it introduces. Requires
135+ ` autoencoder_checkpoint=<ae.ckpt> ` and a raw-Batch datamodule. When the
136+ current datamodule yields ` EncodedBatch ` (cached latents), eval
137+ auto-substitutes the datamodule from ` <cache_dir>/autoencoder_config.yaml `
138+ saved by ` autocast cache-latents ` (pass ` datamodule=... ` explicitly to
139+ override).
140+
141+ #### Latent: measure the processor against the AE's view of the world
142+
143+ ` eval.mode=latent ` forces latent-space rollout: the processor's predicted
144+ latent is fed back as the next latent input and the encoder is never
145+ invoked past step 0. Metrics are decoded to data space via the decoder
146+ saved alongside the cached latents and ** compared against decoded cached
147+ latents** -- i.e. an autoencoder reconstruction of ground truth, not the
148+ raw fields. Use this when you want to isolate the processor's rollout
149+ quality in its own training distribution and explicitly accept that AE
150+ reconstruction error is hidden from the metric.
151+
152+ A reachable decoder is required; if the cache directory's
153+ ` autoencoder_config.yaml ` or checkpoint is missing the run fails fast
154+ rather than silently falling back to computing metrics in raw latent
155+ space (those numbers were never comparable across runs).
156+
157+ ##### Dev sense-check: latent-only metrics
158+
159+ Sometimes you want to iterate on a small processor paired with a large /
160+ expensive autoencoder and skip the decoder entirely. Pass
161+ ` eval.mode=latent eval.latent_space_metrics=true ` to opt in:
162+
163+ ``` bash
164+ autocast eval --workdir < processor_workdir> \
165+ eval.mode=latent \
166+ eval.latent_space_metrics=true \
167+ eval.checkpoint=< processor.ckpt>
168+ ```
169+
170+ This skips the decoder lookup and compares processor predictions against
171+ cached latents directly in the autoencoder's raw latent space. Treat the
172+ numbers as a cheap sanity check only: they are ** not comparable across
173+ runs** (latent space is basis-dependent) and physics-aware metrics
174+ (` psrmse* ` , ` pscc* ` , ` variogram ` ) are not meaningful. The flag is
175+ rejected for any other ` eval.mode ` because the raw-space modes (` auto ` ,
176+ ` ambient ` , ` encode_once ` ) require a decoder by definition.
177+
178+ ### Running the ablations
113179
114180Given an autoencoder checkpoint and a processor checkpoint trained on its
115- cached latents, a minimal invocation is :
181+ cached latents:
116182
117183``` bash
118- # Ambient (encoder -> processor -> decoder at every rollout step)
184+ # Default: auto -> encode_once here (fair processor-only eval, raw ground truth).
185+ autocast eval --workdir < processor_workdir> \
186+ eval.checkpoint=< processor.ckpt> \
187+ autoencoder_checkpoint=< autoencoder.ckpt>
188+
189+ # Apples-to-apples with pure-ambient baselines (charges AE drift).
119190autocast eval --workdir < processor_workdir> \
120191 eval.mode=ambient \
121192 eval.checkpoint=< processor.ckpt> \
122193 autoencoder_checkpoint=< autoencoder.ckpt>
123194
124- # Latent (processor rollout stays in latent space; decoded only for metrics)
195+ # Processor-only latent view; no raw ground truth, hides AE reconstruction error.
125196autocast eval --workdir < processor_workdir> \
126197 eval.mode=latent \
127198 eval.checkpoint=< processor.ckpt>
128199```
129200
130- The ambient run will differ from the latent run by exactly the
131- decode/encode drift accumulated over rollout steps, which is the relevant
132- delta when comparing against purely-ambient baselines.
201+ The three runs differ on rollout metrics as follows:
202+
203+ - ` ambient - encode_once ` = decode/encode drift accumulated over rollout
204+ steps (charged to the autoencoder).
205+ - ` encode_once - latent ` = visibility of AE reconstruction error against the
206+ raw field (absent from ` latent ` , included in ` encode_once ` ).
0 commit comments