Add encode_once eval mode and auto dispatcher#339
Merged
sgreenbury merged 3 commits intomainfrom Apr 21, 2026
Merged
Conversation
Introduce a third eval path that encodes once, rolls out in latent space, then decodes against raw denormalized ground truth. This isolates processor error from autoencoder encode/decode drift while still scoring against real (not AE-reconstructed) truth. eval.mode now accepts auto|ambient|encode_once|latent, defaulting to auto. The auto dispatcher resolves to ambient for full EPD checkpoints, encode_once for processor-only runs that can build a decoder, and latent otherwise. encode_once on an EPD run is aliased back to ambient with a warning since the decoder is already in the loop.
Previously eval.mode=latent silently fell back to computing metrics directly in the autoencoder's raw latent space whenever no decoder could be built from the cached-latents directory. Those numbers look like evaluation results but are not comparable across runs (latent space is basis-dependent) and physics-aware metrics are not meaningful there. Remove the silent fallback. If eval.mode=latent resolves to the latent- only path, fail fast with a message pointing at the new opt-in flag eval.latent_space_metrics (default false). Set it to true alongside eval.mode=latent to skip the decoder entirely as a cheap dev sense-check for iterating on a small processor paired with an expensive autoencoder; a prominent warning flags the caveat. The flag is rejected for auto / ambient / encode_once because those modes require a decoder by definition.
Collapse narrative comments and docstrings that duplicated the eval README, inline the resolve-auto branches, and drop the unused RESOLVABLE_EVAL_MODES constant. Shorten user-facing error and warning strings to describe the current behaviour only. Fold two parametrized no-op tests into the nearest behavioural test so coverage is preserved while the suite is less noisy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new evaluation mode,
encode_once, that gives a fair apples-to-apples comparison for processors trained in latent space. The processor rolls out entirely in its native latent space (no decode/encode drift charged to it), but metrics are computed against the original raw denormalized ground truth -- so latent-rollout models can be scored directly against pure-ambient baselines without either side getting an unfair penalty or advantage.New
encode_oncemode: encoder runs once on raw inputs, processor rolls out in latent space, decoder runs per step; metrics compare decoded predictions against denormalized rawbatch.output_fields.eval.mode=auto(new default) dispatches to the faithful concrete mode per run:ambientencode_oncelatentResolved mode is logged at INFO.
encode_onceon a full EPD aliases back toambientwith a warning (no separate latent rollout to isolate).eval.latent_space_metricsflag replaces the previous silent "no decoder, compute in raw latent space" fallback. Wheneval.mode=latentcannot build a decoder the run now fails fast and asks the user to either fix the AE path or seteval.latent_space_metrics=truefor a dev sense check. Rejected forauto/ambient/encode_once(those modes require a decoder by definition).Docs, docstrings, and test parametrizations streamlined in a follow-up refactor commit.
See
src/autocast/configs/eval/README.mdfor the full mode table and auto-dispatch rules.Test plan
ruff check+ruff format --checkcleanpyrightcleanpytest tests/scripts/test_eval_encoder_processor_decoder.py-- 60 passed, including new coverage forautodispatch across the (processor_only, batch_type, ae_ckpt) matrixencode_oncepath resolution and_resolve_eval_pathmatrix_validate_latent_space_metrics_flagrejection on auto/ambient/encode_once_require_decoder_unless_latent_metrics_opt_infail-fast vs. opt-in warning paths_maybe_swap_to_ambient_datamoduleswap forencode_once