-
Notifications
You must be signed in to change notification settings - Fork 179
Description
Summary
DummyProbaRegressor._predict_var returns the standard deviation (σ) of the training labels instead of the variance (σ²). This is a unit-mismatch bug — variance ≠ std dev — so any downstream usage of predict_var on this estimator produces numerically incorrect results.
Steps to Reproduce
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from skpro.regression.dummy import DummyProbaRegressor
X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X.iloc[:50]
y = pd.DataFrame(y.iloc[:50])
X_train, X_test, y_train, _ = train_test_split(X, y, random_state=42)
reg = DummyProbaRegressor(strategy="empirical")
reg.fit(X_train, y_train)
y_pred_var = reg.predict_var(X_test)
expected_variance = np.var(y_train.values)
std_dev = np.std(y_train.values)
print(f"predict_var returned: {y_pred_var.values[0, 0]:.4f}")
print(f"std dev (wrong): {std_dev:.4f}")
print(f"variance (correct): {expected_variance:.4f}")Expected output:
predict_var returned: 3057.xxxx # matches variance
std dev (wrong): 55.xxxx
variance (correct): 3057.xxxx
Actual output (buggy):
predict_var returned: 55.xxxx # matches std dev — WRONG
std dev (wrong): 55.xxxx
variance (correct): 3057.xxxx
Root Cause
In skpro/regression/dummy.py:
# _fit — self._sigma stores std dev, not variance
self._sigma = np.std(y.values) # ← σ
# _predict_var — BUG: fills with self._sigma (std dev) instead of variance
y_pred = pd.DataFrame(
np.ones(X_n_rows) * self._sigma, # ← should be self._sigma**2 or self._var
index=X_ind, columns=self._y_columns
)The method _predict_var is documented to return variance, but uses self._sigma which stores the standard deviation. The same bug affects both 'empirical' and 'normal' strategies.
Expected Behaviour
predict_var should return values equal to np.var(y_train.values) — the variance of the training labels — matching the definition in BaseProbaRegressor.
Fix Direction
- In
_fit, storeself._var = np.var(y.values)alongside the existingself._sigma. - In
_predict_var, replaceself._sigmawithself._var.
This is a 2-line fix in a single file (skpro/regression/dummy.py, 182 lines total).
Environment
- skpro version: current
main(commitc83950a) - Python 3.12
Good First Issue?
Yes — this is a straightforward, beginner-friendly fix:
- Single file, 2-line change
- No algorithmic complexity
- Needs 1 new test to verify correctness
- Good entry point for understanding the skpro regressor API (
_fit/_predict_varpattern)