Skip to content

Add encode_once eval mode and auto dispatcher#339

Merged
sgreenbury merged 3 commits intomainfrom
2026-04-20/encode-once-eval
Apr 21, 2026
Merged

Add encode_once eval mode and auto dispatcher#339
sgreenbury merged 3 commits intomainfrom
2026-04-20/encode-once-eval

Conversation

@sgreenbury
Copy link
Copy Markdown
Contributor

@sgreenbury sgreenbury commented Apr 21, 2026

Summary

Add a new evaluation mode, encode_once, that gives a fair apples-to-apples comparison for processors trained in latent space. The processor rolls out entirely in its native latent space (no decode/encode drift charged to it), but metrics are computed against the original raw denormalized ground truth -- so latent-rollout models can be scored directly against pure-ambient baselines without either side getting an unfair penalty or advantage.

  • New encode_once mode: encoder runs once on raw inputs, processor rolls out in latent space, decoder runs per step; metrics compare decoded predictions against denormalized raw batch.output_fields.

  • eval.mode=auto (new default) dispatches to the faithful concrete mode per run:

    • full EPD / stateless AE -> ambient
    • processor-only + autoencoder reachable -> encode_once
    • processor-only + cached latents, no AE -> latent

    Resolved mode is logged at INFO. encode_once on a full EPD aliases back to ambient with a warning (no separate latent rollout to isolate).

  • eval.latent_space_metrics flag replaces the previous silent "no decoder, compute in raw latent space" fallback. When eval.mode=latent cannot build a decoder the run now fails fast and asks the user to either fix the AE path or set eval.latent_space_metrics=true for a dev sense check. Rejected for auto/ambient/encode_once (those modes require a decoder by definition).

  • Docs, docstrings, and test parametrizations streamlined in a follow-up refactor commit.

See src/autocast/configs/eval/README.md for the full mode table and auto-dispatch rules.

Test plan

  • ruff check + ruff format --check clean
  • pyright clean
  • pytest tests/scripts/test_eval_encoder_processor_decoder.py -- 60 passed, including new coverage for
    • auto dispatch across the (processor_only, batch_type, ae_ckpt) matrix
    • encode_once path resolution and _resolve_eval_path matrix
    • _validate_latent_space_metrics_flag rejection on auto/ambient/encode_once
    • _require_decoder_unless_latent_metrics_opt_in fail-fast vs. opt-in warning paths
    • _maybe_swap_to_ambient_datamodule swap for encode_once

Introduce a third eval path that encodes once, rolls out in latent space,
then decodes against raw denormalized ground truth. This isolates
processor error from autoencoder encode/decode drift while still scoring
against real (not AE-reconstructed) truth.

eval.mode now accepts auto|ambient|encode_once|latent, defaulting to auto.
The auto dispatcher resolves to ambient for full EPD checkpoints,
encode_once for processor-only runs that can build a decoder, and latent
otherwise. encode_once on an EPD run is aliased back to ambient with a
warning since the decoder is already in the loop.
Previously eval.mode=latent silently fell back to computing metrics
directly in the autoencoder's raw latent space whenever no decoder could
be built from the cached-latents directory. Those numbers look like
evaluation results but are not comparable across runs (latent space is
basis-dependent) and physics-aware metrics are not meaningful there.

Remove the silent fallback. If eval.mode=latent resolves to the latent-
only path, fail fast with a message pointing at the new opt-in flag
eval.latent_space_metrics (default false). Set it to true alongside
eval.mode=latent to skip the decoder entirely as a cheap dev sense-check
for iterating on a small processor paired with an expensive autoencoder;
a prominent warning flags the caveat. The flag is rejected for auto /
ambient / encode_once because those modes require a decoder by
definition.
Collapse narrative comments and docstrings that duplicated the eval
README, inline the resolve-auto branches, and drop the unused
RESOLVABLE_EVAL_MODES constant. Shorten user-facing error and warning
strings to describe the current behaviour only. Fold two parametrized
no-op tests into the nearest behavioural test so coverage is preserved
while the suite is less noisy.
@sgreenbury sgreenbury merged commit de86747 into main Apr 21, 2026
3 checks passed
@sgreenbury sgreenbury deleted the 2026-04-20/encode-once-eval branch April 21, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant