Skip to content

(Requires design discussion) handling all NaNs - #703 #776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

JinghanFu
Copy link
Contributor

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

  • Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
  • Typehints added
  • Add error handling
  • Imported into the API
  • Works with both xr.DataArrays and xr.Datasets if possible

Docstrings

  • Docstrings complete and follow Napoleon (google) style
  • Maths equation added
  • Reference to paper/webpage is in docstring. The preferred referencing style for journal articles is APA (7th edition)
  • Code example added

Testing of new xarray-based metrics

  • 100% unit test coverage
  • Test that metric is compatible with dask.
  • Test that metrics work with inputs that contain NaNs
  • Test that broadcasting with xarray works
  • Test both reduce and preserve dims arguments work
  • Test that errors are raised as expected
  • Test that it works with both xr.DataArrays and xr.Datasets

Tutorial notebook

  • Short introduction to why you would use that metric and what it tells you
  • A link to a reference
  • A "things to try next" section at the end
  • Add notebook to Tutorial_Gallery.ipynb
  • Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

@tennlee
Copy link
Collaborator

tennlee commented Nov 26, 2024

@nicholasloveday I think this PR is interesting from the perspective of what kinds of conditions should raise a user warning. Perhaps for all our methods, anything which is all NaN should raise a user warning? Correct setting of NaNs is intended behaviour, so doesn't indicate a program failure. On the other hand, something being entirely NaN could be an indication of a data issue or user error, and a warning could be useful. We should probably consider adopting something consistent across the codebase. Take a look at what @JinghanFu has done in this example and have a think. This work was done because @durgals mentioned it following his KGE work as being useful. @durgals you might also like to make a comment about what kinds of user warning should perhaps be raised across all our methods. Thanks @JinghanFu for putting together a concrete example for us to discuss.

@JinghanFu
Copy link
Contributor Author

@tennlee Thank you very much for your guidance throughout the day!! Looking forward to joining the whole PyCon next year and contributing more!

@tennlee tennlee added this to the Wishlist milestone Dec 5, 2024
@nikeethr
Copy link
Collaborator

nikeethr commented Mar 15, 2025

Im guessing this is from a sprint of sorts. But in the interest of getting PRs through before they are stale. I have a strong opinion on how this should be handled see issue for detail.

At least one dimension MUST be reduced for ANY operation that requires accumulation of a computation along a dimension (such as a sum) for a successful computation. Dividing by variance is one common example of this. Which is common to quite a few scores actually. MSE for example doesn’t require this - so it’s fine to not raise anything, although it’s still silly and I actually wouldn’t mind if it threw an exception for attempting use dask to compute an identity function in an overly contrived way.

This should be an EARLY check on preserve and reduce dimensions rather than a post score check, because I already know the result without doing any scoring - it’s nan…

We don’t want to mix up failure states where the user input is invalid for a score vs a score returning an invalid result. The latter could be a warning but the former MUST be an exception because that fundamentally invalid input for the score.

having said that even if the score were to return all NaNs given a valid input id still be expect an error because it seems to me something fundamentally unhandled within the computation. The only exception to this is to raise a warning for partially computed results if they are still valid and there’s no trivial way of avoiding it (such as divide by 0 on some elements but not others)

#815 currently handles it the way I’d like it to. I also put some comments in #703

@nikeethr
Copy link
Collaborator

nikeethr commented Mar 15, 2025

Also minor comments though I understand this may have been done as a proof of concept:

  • even though “never” is strong and I typically don’t like to use it - in this case it’s warranted - never assign a default value to an empty (or any) mutable structure. The original implementation with None was correct.
  • try not to use math when we already have access to numpy/xarray’s port of numpys ufuncs
  • I’m guessing the gobals are just for PoC we just need to handle them appropriately within internal scope.

@nikeethr nikeethr marked this pull request as draft March 24, 2025 02:25
@nikeethr
Copy link
Collaborator

This implementation needs to be reconsidered - I'm leaving it open but converting to draft.

@nikeethr nikeethr changed the title 703 warn user when pearsonr returns nan + a flag enable or disable the warning message (default is disable) (Requires design discussion) handling all NaNs - #703 Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review docstring and possibly raise a userwarning for pearsonr based on preserve_dims=all
3 participants