Skip to content

Use DatasetGroupBy.quantile for DatasetGroupBy.median for multiple groups when using dask arrays #9935

Open
@adriaat

Description

Is your feature request related to a problem?

I am grouping data in a Dataset and computing statistics. I wanted to take the median over (two) groups, but I got the following message:

>>> ds.groupby(['x', 'y']).median()
# NotImplementedError: The da.nanmedian function only works along an axis or a subset of axes.  The full algorithm is difficult to do in parallel

while ds.groupby(['x']).median() works without any problem.

I noticed that this issue is because the DataArrays are dask arrays: if they are numpy arrays, there is no problem. In addition, if .median() is replaced by .quantile(0.5), there is no problem either. See below:

import dask.array as da
import numpy as np
import xarray as xr

rng = da.random.default_rng(0)
ds = xr.Dataset(
    {'a': (('x', 'y'), rng.random((10, 10)))},
    coords={'x': np.arange(5).repeat(2), 'y': np.arange(5).repeat(2)}
)

# Raises:
# NotImplementedError: The da.nanmedian function only works along an axis or a subset of axes.  The full algorithm is difficult to do in parallel
try:
    ds.groupby(['x', 'y']).median()
except NotImplementedError as e:
    print(e)

# No problems with the following:
ds.groupby(['x']).median()
ds.groupby(['x', 'y']).quantile(0.5)
ds.compute().groupby(['x', 'y']).median() # Implicit conversion to numpy array

Describe the solution you'd like

A straightforward solution seems to be to use DatasetGroupBy.quantile(0.5) for DatasetGroupBy.median() if the median is to be computed over multiple groups.

Describe alternatives you've considered

No response

Additional context

My xr.show_versions():

INSTALLED VERSIONS ------------------ commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 6.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2024.10.0
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.14.1
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2024.11.2
distributed: None
matplotlib: 3.9.2
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.5.0
pip: 24.3.1
conda: None
pytest: None
mypy: None
IPython: 8.29.0
sphinx: 7.4.7

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions