Skip to content

[BUG] DummyProbaRegressor._predict_var returns std dev (σ) instead of variance (σ²) #975

@maniktyagi04

Description

@maniktyagi04

Summary

DummyProbaRegressor._predict_var returns the standard deviation (σ) of the training labels instead of the variance (σ²). This is a unit-mismatch bug — variance ≠ std dev — so any downstream usage of predict_var on this estimator produces numerically incorrect results.


Steps to Reproduce

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from skpro.regression.dummy import DummyProbaRegressor

X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X.iloc[:50]
y = pd.DataFrame(y.iloc[:50])
X_train, X_test, y_train, _ = train_test_split(X, y, random_state=42)

reg = DummyProbaRegressor(strategy="empirical")
reg.fit(X_train, y_train)

y_pred_var = reg.predict_var(X_test)

expected_variance = np.var(y_train.values)
std_dev            = np.std(y_train.values)

print(f"predict_var returned: {y_pred_var.values[0, 0]:.4f}")
print(f"std dev (wrong):      {std_dev:.4f}")
print(f"variance (correct):   {expected_variance:.4f}")

Expected output:

predict_var returned: 3057.xxxx   # matches variance
std dev (wrong):      55.xxxx
variance (correct):   3057.xxxx

Actual output (buggy):

predict_var returned: 55.xxxx    # matches std dev — WRONG
std dev (wrong):      55.xxxx
variance (correct):   3057.xxxx

Root Cause

In skpro/regression/dummy.py:

# _fit — self._sigma stores std dev, not variance
self._sigma = np.std(y.values)   # ← σ

# _predict_var — BUG: fills with self._sigma (std dev) instead of variance
y_pred = pd.DataFrame(
    np.ones(X_n_rows) * self._sigma,   # ← should be self._sigma**2 or self._var
    index=X_ind, columns=self._y_columns
)

The method _predict_var is documented to return variance, but uses self._sigma which stores the standard deviation. The same bug affects both 'empirical' and 'normal' strategies.


Expected Behaviour

predict_var should return values equal to np.var(y_train.values) — the variance of the training labels — matching the definition in BaseProbaRegressor.


Fix Direction

  1. In _fit, store self._var = np.var(y.values) alongside the existing self._sigma.
  2. In _predict_var, replace self._sigma with self._var.

This is a 2-line fix in a single file (skpro/regression/dummy.py, 182 lines total).


Environment

  • skpro version: current main (commit c83950a)
  • Python 3.12

Good First Issue?

Yes — this is a straightforward, beginner-friendly fix:

  • Single file, 2-line change
  • No algorithmic complexity
  • Needs 1 new test to verify correctness
  • Good entry point for understanding the skpro regressor API (_fit/_predict_var pattern)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions