Skip to content

Update eval scripts and default metrics#335

Merged
sgreenbury merged 11 commits intomainfrom
2026-04-19/eval-scripts
Apr 20, 2026
Merged

Update eval scripts and default metrics#335
sgreenbury merged 11 commits intomainfrom
2026-04-19/eval-scripts

Conversation

@sgreenbury
Copy link
Copy Markdown
Contributor

This pull request introduces a comprehensive set of evaluation scripts and configuration updates to support systematic comparisons. It adds six SLURM submission scripts for running evaluations in both ambient and latent rollout modes, expands the set of evaluation metrics across all relevant config files, and provides detailed documentation for running and interpreting these evaluations.

Major changes include:

1. New Evaluation Scripts for SLURM Batch Submission

  • Added six bash scripts under slurm_scripts/comparison/eval/ to automate evaluation of CRPS and FM models in both ambient and cached-latent (latent and ambient rollout) settings. Each script supports dry-run and real submission modes, handles batch size and metrics configuration, and manages AE checkpoint mapping for latent models. [1] [2] [3] [4] [5] [6]

2. Documentation for Evaluation Procedures

  • Added a detailed README.md in the evaluation scripts directory. This documentation explains the rationale for batch sizes, the use of eval.mode, the structure and independence of scripts, and instructions for dry-run and submission.

3. Expanded Evaluation Metrics in Configurations

  • Updated all relevant evaluation config files (default.yaml, encoder_processor_decoder.yaml, processor.yaml) to include a more comprehensive set of metrics (e.g., nmse, nmae, nrmse, vmse, linf, winkler, etc.), ensuring consistency and richer evaluation outputs across model types and rollout modes. [1] [2] [3] [4] [5]

4. Minor Codebase Improvement

  • Added a missing import contextlib in src/autocast/scripts/eval/encoder_processor_decoder.py to support context management in the evaluation CLI.

These changes collectively enable robust, reproducible, and well-documented evaluation of both ambient and latent models, facilitating apples-to-apples comparison and more informative benchmarking.

@sgreenbury sgreenbury merged commit 724223d into main Apr 20, 2026
3 checks passed
@sgreenbury sgreenbury deleted the 2026-04-19/eval-scripts branch April 20, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant