Update eval scripts and default metrics#335
Merged
sgreenbury merged 11 commits intomainfrom Apr 20, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a comprehensive set of evaluation scripts and configuration updates to support systematic comparisons. It adds six SLURM submission scripts for running evaluations in both ambient and latent rollout modes, expands the set of evaluation metrics across all relevant config files, and provides detailed documentation for running and interpreting these evaluations.
Major changes include:
1. New Evaluation Scripts for SLURM Batch Submission
slurm_scripts/comparison/eval/to automate evaluation of CRPS and FM models in both ambient and cached-latent (latent and ambient rollout) settings. Each script supports dry-run and real submission modes, handles batch size and metrics configuration, and manages AE checkpoint mapping for latent models. [1] [2] [3] [4] [5] [6]2. Documentation for Evaluation Procedures
README.mdin the evaluation scripts directory. This documentation explains the rationale for batch sizes, the use ofeval.mode, the structure and independence of scripts, and instructions for dry-run and submission.3. Expanded Evaluation Metrics in Configurations
default.yaml,encoder_processor_decoder.yaml,processor.yaml) to include a more comprehensive set of metrics (e.g.,nmse,nmae,nrmse,vmse,linf,winkler, etc.), ensuring consistency and richer evaluation outputs across model types and rollout modes. [1] [2] [3] [4] [5]4. Minor Codebase Improvement
import contextlibinsrc/autocast/scripts/eval/encoder_processor_decoder.pyto support context management in the evaluation CLI.These changes collectively enable robust, reproducible, and well-documented evaluation of both ambient and latent models, facilitating apples-to-apples comparison and more informative benchmarking.