Skip to content

add Winkler metric#310

Merged
ContiPaolo merged 2 commits intomainfrom
286-winkler
Apr 2, 2026
Merged

add Winkler metric#310
ContiPaolo merged 2 commits intomainfrom
286-winkler

Conversation

@ContiPaolo
Copy link
Copy Markdown
Contributor

@ContiPaolo ContiPaolo commented Apr 2, 2026

Closes #286 by implementing Winkler metric

Summary of Changes

  • Added Winkler Interval Score (WinklerScore) as a new ensemble metric in ensemble.py.
  • Registered the metric in the metric registry and evaluation scripts.
  • Added unit tests for correctness and parameter validation.
  • Documented the metric with explicit shape conventions and mathematical definition.

Winkler Interval Score Definition

The Winkler interval score evaluates the sharpness and coverage of central prediction intervals. For significance level $\alpha \in (0, 1)$, the score for a single point is:

$$W_\alpha = (u - l) + \frac{2}{\alpha}(l - y)\mathbf{1}(y < l) + \frac{2}{\alpha}(y - u)\mathbf{1}(y > u)$$

where:

  • $l$ is the lower bound of the central $(1-\alpha)$ prediction interval (ensemble quantile at $\alpha/2$)
  • $u$ is the upper bound (ensemble quantile at $1-\alpha/2$)
  • $y$ is the observed value

Lower values are better: narrow intervals are rewarded, and misses are penalized in proportion to their distance outside the interval.

The implementation supports flexible reduction over spatial/temporal dimensions and returns the mean Winkler score by default.

Copilot AI review requested due to automatic review settings April 2, 2026 12:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ensemble evaluation metric (Winkler Interval Score) to quantify prediction interval sharpness/coverage tradeoffs, integrating it into the metrics registry, evaluation CLI defaults, and unit tests.

Changes:

  • Implemented WinklerScore as a new BTSCMMetric ensemble metric based on ensemble quantiles.
  • Registered the metric in autocast.metrics exports and the encoder/processor/decoder eval script metric registry/default list.
  • Added unit tests covering a hand-computed reference case and alpha parameter validation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/autocast/metrics/ensemble.py Adds WinklerScore implementation and its docstring/shape conventions.
src/autocast/metrics/__init__.py Exports and registers WinklerScore in ALL_ENSEMBLE_METRICS.
src/autocast/scripts/eval/encoder_processor_decoder.py Makes winkler available in the eval CLI and adds it to default metrics.
tests/metrics/test_ensemble.py Adds correctness and input-validation tests for WinklerScore.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/autocast/metrics/ensemble.py
@ContiPaolo
Copy link
Copy Markdown
Contributor Author

@sgreenbury Comments on computational cost: Winkler (#286) vs Variogram(#285)

  • The Variogram score can become computationally and memory intensive when applied to the full $(B, T, S, C)$ tensor, because it computes all pairwise differences across the selected vector dimensions, leading to $O(D^2)$ scaling (where $D$ is the flattened spatial/temporal/channel size).
  • The Winkler score does not suffer from this problem. It only computes two quantiles (lower/upper) along the ensemble axis $M$, then applies elementwise operations to produce a tensor of shape $(B, T, S..., C)$. There are no pairwise operations across spatial/temporal/channel axes.
  • The only scenario where Winkler can produce numerically large values is if $\alpha$ is set very small (since penalties scale as $2/\alpha$), but this does not affect memory or computational scaling.

Copy link
Copy Markdown
Contributor

@sgreenbury sgreenbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ContiPaolo ContiPaolo merged commit 45d9ddc into main Apr 2, 2026
3 checks passed
@ContiPaolo ContiPaolo deleted the 286-winkler branch April 2, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metric for capturing interval sharpness

3 participants