Skip to content

fix: broadcast errors using lazy n_samples and da.where in r2_score #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wietzesuijker
Copy link

@wietzesuijker wietzesuijker commented Mar 3, 2025

Closes #1012

First PR here. Curious to hear your feedback.

Problem
After updating to Dask 2025.2.0, tests fail with a ValueError due to changes in chunk size handling.

Solution

  • Track sample count in fit() using X.shape[0].
  • Use n_samples to derive the rechunking block size so test data aligns one-to-one with training blocks, preventing broadcast mismatches.
  • Refactor r2_score() to use da.where() for correct broadcasting.

Testing
Tests added to ensure:

  • BlockwiseVotingRegressor supports various chunking patterns.
  • r2_score() works correctly with arrays that have different chunk configurations.

@wietzesuijker wietzesuijker force-pushed the fix/broadcast-shape-nan branch from d7f7b86 to 17d3a02 Compare March 23, 2025 15:40
@wietzesuijker wietzesuijker changed the title fix(ensemble, metrics): compute chunk sizes and refactor r2_score wit… prevent broadcasting errors with unknown chunk sizes Mar 23, 2025
@TomAugspurger
Copy link
Member

Thanks.

I'm not entirely sure what the best action is, but I think we ought to avoid anything that triggers computation unnecessarily, including len.

Can you say a bit more about getting n_samples is needed in blockwise?

- Add _safe_rechunk helper for safe rechunking with error handling.
- Set _n_samples using X.shape[0] in fit() to avoid eager evaluation from len(X).
- Use n_samples to derive the rechunking block size so test data aligns one-to-one with training blocks, preventing broadcast mismatches.
- Update _predict()/_collect_probas() accordingly.
- Refactor r2_score() to use da.where() for correct broadcasting.
- Resolves "cannot broadcast shape (nan,) to shape (nan,)" errors.
@wietzesuijker wietzesuijker force-pushed the fix/broadcast-shape-nan branch from 17d3a02 to e98c538 Compare March 29, 2025 21:25
@wietzesuijker wietzesuijker changed the title prevent broadcasting errors with unknown chunk sizes fix: broadcast errors using lazy n_samples and da.where in r2_score Mar 29, 2025
@wietzesuijker
Copy link
Author

Thanks @TomAugspurger. n_samples (now obtained via X.shape[0]) lets us determine the rechunking size without forcing computation. It splits the test data into one block per trained estimator, ensuring alignment and preventing broadcast errors. Combined with the da.where() update in r2_score, these changes maintain laziness and correct behavior with mismatched chunks.

@TomAugspurger
Copy link
Member

I'm probably missing something, but why do we care that the size of the test dataset matches the size of the training dataset (_n_samples)? I'd expect us to just care that the number of samples in X_train and y_train to match, and separately that the number of samples in X_test, y_test match.

@wietzesuijker
Copy link
Author

why do we care that the size of the test dataset matches the size of the training dataset (_n_samples)?

The goal is not for the test dataset to match the training dataset's overall size. The focus is ensuring each estimator, trained on a specific data block, receives a matching block from the test set. X.shape[0] is used (as n_samples) to compute the optimal test data block size, dividing the test set into blocks equal to the number of estimators. This aligns predictions and prevents broadcast errors, regardless of training and test dataset sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failing with ValueError: cannot broadcast shape (nan,) to shape (nan,)
2 participants