Add eval.mode selector for ambient vs latent rollout#327
Merged
sgreenbury merged 2 commits intomainfrom Apr 20, 2026
Merged
Conversation
92e64d6 to
2a4ce25
Compare
Processor checkpoints trained on cached latents were evaluated with rollout happening entirely in latent space, with the decoder applied only once before metrics. This made comparisons against data-space baselines (e.g. CRPS against a non-autoencoder model) unfair because decode/encode drift accumulated across rollout steps was hidden. Add an explicit eval.mode config (auto | ambient | latent) that forces the rollout regime and validates the resolved code path against the requested mode. When ambient is requested on a cached_latents datamodule, substitute the raw-data datamodule from autoencoder_config.yaml saved by autocast cache-latents so the encoder sees matching fields and normalization. Auto remains the default and preserves historical behavior.
Pin that EncoderProcessorDecoder.rollout invokes the encoder once per rollout step by counting encode calls via a wrapped PermuteConcat. This guards eval.mode=ambient against a silent regression where a refactor collapses the rollout into a latent-only loop: in that case ambient and latent eval would report identical numbers, so the ablation (the whole reason the mode exists) would be meaningless.
2a4ce25 to
d53d411
Compare
sgreenbury
added a commit
that referenced
this pull request
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
eval.modeconfig key (auto | ambient | latent, defaultauto) toautocast evalso users can force the rollout regime used when evaluating an EPD stack against cached latents vs raw data.eval.mode=ambientis requested but the datamodule yieldsEncodedBatch(cached latents), auto-swap in the raw-data datamodule saved alongside the cache byautocast cache-latents(<cache_dir>/autoencoder_config.yaml). An explicitdatamodule=...override still wins. Passing cached latents forlatent/autocontinues to work unchanged.ambientbut we only have cached latents and no AE checkpoint) raises instead of silently falling back.EncoderProcessorDecoder.rolloutinvokes the encoder once per rollout step (via a countingPermuteConcatwrapper). This pins the contracteval.mode=ambientrests on: each step decodes and re-encodes, so decode/encode drift is included in the metrics.src/autocast/configs/eval/README.md, including an ambient-vs-latent ablation recipe.Why
Historically
autocast evalon a processor checkpoint + cached latents rolls out entirely in latent space and only decodes at the end for metrics. That isn't apples-to-apples with models that natively roll out in data space (e.g. CRPS baselines).eval.mode=ambientmakes the comparison flexible and ambient-vs-latent an easy ablation;autopreserves existing behaviour.Test plan
pytest tests/models/test_encoder_processor_decoder.py tests/scripts/test_eval_encoder_processor_decoder.py(58 passed locally)ruff check,ruff format,pyrightvia pre-commiteval.mode=ambientvslatenton a small cached-latents dataset to confirm the two paths produce different numbers (out of scope for this PR)