add Winkler metric by ContiPaolo · Pull Request #310 · alan-turing-institute/autocast

ContiPaolo · 2026-04-02T12:35:40Z

Closes #286 by implementing Winkler metric

Summary of Changes

Added Winkler Interval Score (WinklerScore) as a new ensemble metric in ensemble.py.
Registered the metric in the metric registry and evaluation scripts.
Added unit tests for correctness and parameter validation.
Documented the metric with explicit shape conventions and mathematical definition.

Winkler Interval Score Definition

The Winkler interval score evaluates the sharpness and coverage of central prediction intervals. For significance level $\alpha \in (0, 1)$, the score for a single point is:

$$W_\alpha = (u - l) + \frac{2}{\alpha}(l - y)\mathbf{1}(y < l) + \frac{2}{\alpha}(y - u)\mathbf{1}(y > u)$$

where:

$l$ is the lower bound of the central $(1-\alpha)$ prediction interval (ensemble quantile at $\alpha/2$)
$u$ is the upper bound (ensemble quantile at $1-\alpha/2$)
$y$ is the observed value

Lower values are better: narrow intervals are rewarded, and misses are penalized in proportion to their distance outside the interval.

The implementation supports flexible reduction over spatial/temporal dimensions and returns the mean Winkler score by default.

Copilot

Pull request overview

Adds a new ensemble evaluation metric (Winkler Interval Score) to quantify prediction interval sharpness/coverage tradeoffs, integrating it into the metrics registry, evaluation CLI defaults, and unit tests.

Changes:

Implemented WinklerScore as a new BTSCMMetric ensemble metric based on ensemble quantiles.
Registered the metric in autocast.metrics exports and the encoder/processor/decoder eval script metric registry/default list.
Added unit tests covering a hand-computed reference case and alpha parameter validation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`src/autocast/metrics/ensemble.py`	Adds `WinklerScore` implementation and its docstring/shape conventions.
`src/autocast/metrics/__init__.py`	Exports and registers `WinklerScore` in `ALL_ENSEMBLE_METRICS`.
`src/autocast/scripts/eval/encoder_processor_decoder.py`	Makes `winkler` available in the eval CLI and adds it to default metrics.
`tests/metrics/test_ensemble.py`	Adds correctness and input-validation tests for `WinklerScore`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ContiPaolo · 2026-04-02T12:53:50Z

@sgreenbury Comments on computational cost: Winkler (#286) vs Variogram(#285)

The Variogram score can become computationally and memory intensive when applied to the full $(B, T, S, C)$ tensor, because it computes all pairwise differences across the selected vector dimensions, leading to $O(D^2)$ scaling (where $D$ is the flattened spatial/temporal/channel size).
The Winkler score does not suffer from this problem. It only computes two quantiles (lower/upper) along the ensemble axis $M$, then applies elementwise operations to produce a tensor of shape $(B, T, S..., C)$. There are no pairwise operations across spatial/temporal/channel axes.
The only scenario where Winkler can produce numerically large values is if $\alpha$ is set very small (since penalties scale as $2/\alpha$), but this does not affect memory or computational scaling.

sgreenbury

LGTM!

add Winkler metric

6073dd2

Copilot AI review requested due to automatic review settings April 2, 2026 12:35

Copilot started reviewing on behalf of ContiPaolo April 2, 2026 12:36 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Comment thread src/autocast/metrics/ensemble.py

clean documentation

1d6a083

sgreenbury approved these changes Apr 2, 2026

View reviewed changes

ContiPaolo merged commit 45d9ddc into main Apr 2, 2026
3 checks passed

ContiPaolo deleted the 286-winkler branch April 2, 2026 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Winkler metric#310

add Winkler metric#310
ContiPaolo merged 2 commits intomainfrom
286-winkler

ContiPaolo commented Apr 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ContiPaolo commented Apr 2, 2026

Uh oh!

sgreenbury left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ContiPaolo commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Winkler Interval Score Definition

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ContiPaolo commented Apr 2, 2026

Uh oh!

sgreenbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ContiPaolo commented Apr 2, 2026 •

edited

Loading