Skip to content

Add eval.mode selector for ambient vs latent rollout#327

Merged
sgreenbury merged 2 commits intomainfrom
add-eval-modes
Apr 20, 2026
Merged

Add eval.mode selector for ambient vs latent rollout#327
sgreenbury merged 2 commits intomainfrom
add-eval-modes

Conversation

@sgreenbury
Copy link
Copy Markdown
Contributor

@sgreenbury sgreenbury commented Apr 17, 2026

Summary

  • Add an explicit eval.mode config key (auto | ambient | latent, default auto) to autocast eval so users can force the rollout regime used when evaluating an EPD stack against cached latents vs raw data.
  • When eval.mode=ambient is requested but the datamodule yields EncodedBatch (cached latents), auto-swap in the raw-data datamodule saved alongside the cache by autocast cache-latents (<cache_dir>/autoencoder_config.yaml). An explicit datamodule=... override still wins. Passing cached latents for latent / auto continues to work unchanged.
  • Validate loudly: unknown modes error early, and a resolved eval path that doesn't match the requested mode (e.g. ambient but we only have cached latents and no AE checkpoint) raises instead of silently falling back.
  • Add unit tests for mode normalization, path resolution, the validation branches, and the datamodule auto-swap (happy path + missing-config / missing-data-path errors).
  • Add an end-to-end invariant test that EncoderProcessorDecoder.rollout invokes the encoder once per rollout step (via a counting PermuteConcat wrapper). This pins the contract eval.mode=ambient rests on: each step decodes and re-encodes, so decode/encode drift is included in the metrics.
  • Document the new knob in src/autocast/configs/eval/README.md, including an ambient-vs-latent ablation recipe.

Why

Historically autocast eval on a processor checkpoint + cached latents rolls out entirely in latent space and only decodes at the end for metrics. That isn't apples-to-apples with models that natively roll out in data space (e.g. CRPS baselines). eval.mode=ambient makes the comparison flexible and ambient-vs-latent an easy ablation; auto preserves existing behaviour.

Test plan

  • pytest tests/models/test_encoder_processor_decoder.py tests/scripts/test_eval_encoder_processor_decoder.py (58 passed locally)
  • ruff check, ruff format, pyright via pre-commit
  • CI green on this PR
  • Follow-up: spot-check one real EPD checkpoint with eval.mode=ambient vs latent on a small cached-latents dataset to confirm the two paths produce different numbers (out of scope for this PR)

Processor checkpoints trained on cached latents were evaluated with
rollout happening entirely in latent space, with the decoder applied
only once before metrics. This made comparisons against data-space
baselines (e.g. CRPS against a non-autoencoder model) unfair because
decode/encode drift accumulated across rollout steps was hidden.

Add an explicit eval.mode config (auto | ambient | latent) that
forces the rollout regime and validates the resolved code path
against the requested mode. When ambient is requested on a
cached_latents datamodule, substitute the raw-data datamodule from
autoencoder_config.yaml saved by autocast cache-latents so the
encoder sees matching fields and normalization. Auto remains the
default and preserves historical behavior.
Pin that EncoderProcessorDecoder.rollout invokes the encoder once per
rollout step by counting encode calls via a wrapped PermuteConcat.
This guards eval.mode=ambient against a silent regression where a
refactor collapses the rollout into a latent-only loop: in that case
ambient and latent eval would report identical numbers, so the
ablation (the whole reason the mode exists) would be meaningless.
@sgreenbury sgreenbury merged commit 9830e1d into main Apr 20, 2026
3 checks passed
@sgreenbury sgreenbury deleted the add-eval-modes branch April 20, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant