initial support for Dask DataFrames in obsm/varm#1880
initial support for Dask DataFrames in obsm/varm#1880ilia-kats wants to merge 1 commit intoscverse:mainfrom
Conversation
8a4761f to
2c3b39e
Compare
2c3b39e to
982f882
Compare
|
The minimum_versions test is failing due to an incompatibility between the old dask version and Python 3.11.9 specifically, not sure what to do here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1880 +/- ##
==========================================
- Coverage 86.11% 83.60% -2.51%
==========================================
Files 40 40
Lines 6242 6258 +16
==========================================
- Hits 5375 5232 -143
- Misses 867 1026 +159
🚀 New features to boost your workflow:
|
|
@ilia-kats Furthermore, we have reports from I had previously tried to do this, but there were several issues that made things quite unusable |
|
I like that the PR is basically write-only, but want to understand more. |
|
So my usecase is that I have several AnnData objects that I need to concatenate, and I want to do that as lazily as possible, without allocating memory, so what I'm doing is I'm converting everything in the AnnData objects to Dask equivalents and then running |
This was one of our stumbling blocks. It required a full pass over the data the last time we checked which kind of defeats the purpose. The above PR I linked to has full lazy concatenation features. |
|
That is good to know, I'll wait for that to be merged then, I suppose. |
|
So I've looked at In the very simplest case, it looks like calling datest = dd.from_pandas(test)
datest2 = dd.from_pandas(test2)
concat = dd.concat([datest, datest2], axis=0, ignore_index=True)
concat.shape[0].optimize().pprint()
Fused(25743):
| FloorDiv: right=1
| Add:
Literal: value=40
| Add: left=0
Literal: value=10 |
|
@ilia-kats I assumed you wanted the dask dataframes for its lazy-loading capabilities, in which case memory shouldn't be such an issue. Perhaps we can talk offline: ilan.gold@helmholtz-munich.de - I am around all day today |
But why keep the dataframes around once they are converted? |
|
Also the code you posted is only applicable for non-extension arrays - pandas extension arrays should be 0-copy: https://github.com/pydata/xarray/blob/dd446d7d9c5f208cedc18b4b02fcf380a5ba7217/xarray/core/dataset.py#L7272-L7277 This includes https://pandas.pydata.org/docs/reference/api/pandas.arrays.NumpyExtensionArray.html (i.e., numpy), https://pandas.pydata.org/docs/reference/api/pandas.arrays.ArrowExtensionArray.html (i.e., arrow), https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html (categoricals), https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/arrays/string_.py#L275-L657 (string arrays), and likely others |
|
Here's a quick code-snippet: import xarray as xr
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
assert ds["sepal_length"].data is iris["sepal_length"].array.to_numpy() |
|
Thanks, I should have read the xarray code more carefully. I guess I'll go with that then. |
|
@ilia-kats No problem, please reach out if you feel you have more needs. Zulip is the best place for longer discussions :) |
|
Actually, turns out xarray also doesn't work: import xarray as xr
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
adata = ad.AnnData(var=pd.DataFrame(index=iris.index))
adata.varm["test"] = dsresults in AttributeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata.varm["test"] = ds
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:214, in AlignedActual.__setitem__(self, key, value)
213 def __setitem__(self, key: str, value: Value):
--> 214 value = self._validate_value(value, key)
215 self._data[key] = value
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:277, in AxisArraysBase._validate_value(self, val, key)
275 msg = "Index.equals and pd.testing.assert_index_equal disagree"
276 raise AssertionError(msg)
--> 277 return super()._validate_value(val, key)
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:79, in AlignedMappingBase._validate_value(self, val, key)
72 warn_once(
73 "Support for Awkward Arrays is currently experimental. "
74 "Behavior may change in the future. Please report any issues you may encounter!",
75 ExperimentalFeatureWarning,
76 # stacklevel=3,
77 )
78 for i, axis in enumerate(self.axes):
---> 79 if self.parent.shape[axis] == axis_len(val, i):
80 continue
81 right_shape = tuple(self.parent.shape[a] for a in self.axes)
File /usr/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
905 if not args:
906 raise TypeError(f'{funcname} requires at least '
907 '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)
File /data/ilia/anndata/src/anndata/utils.py:115, in axis_len(x, axis)
108 @singledispatch
109 def axis_len(x, axis: Literal[0, 1]) -> int | None:
110 """\
111 Return the size of an array in dimension `axis`.
112
113 Returns None if `x` is an awkward array with variable length in the requested dimension.
114 """
--> 115 return x.shape[axis]
File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/common.py:305, in AttrAccessMixin.__getattr__(self, name)
303 with suppress(KeyError):
304 return source[name]
--> 305 raise AttributeError(
306 f"{type(self).__name__!r} object has no attribute {name!r}"
307 )
AttributeError: 'Dataset' object has no attribute 'shape'and the same for adata = ad.AnnData(var=pd.DataFrame(index=iris.index), varm={"test": ds}) |
|
Ah, I think I'm supposed to use |
|
That makes sense @ilia-kats - if you would like to contribute a feature that allow for the conversion (i.e., setting on an |
|
Sure, I can give it a shot. However, I think that this PR (simply allowing Dask DataFrames) is less invasive than the conversion would be. |
|
I notice that some of the doc strings still say things like "One-dimensional annotation of variables/ features ( |
|
@xinaesthete Could you explain a bit? This PR is super old and dask does not have an amazing maintenance status at the moment. If someone wants to do #2043 then we could support them just fine. The TL;DR on that issue is that we use a limited subset of the APIs dataframes offer, and formalizing this as a runtime protocol would allow you to create wrappers around the dataframe of interest (polars, cudf etc.) without us having to implement something new for every dataframe implementation. |
|
Not a big deal, and I think we'd been on a very old version of AnnData until recently, and we had some things that had been expecting |
Yeah, this could be clarified but I can also see why it is that way. It matches e.g., I would be open to changing these to read "key-value store of ... annotation of ....". Would that make more sense? |
Yes we have a I would like to remove it though and put it in a separate package, and then allow general DF apis hence #2043 |
This PR adds support for Dask DataFrames in
.obsm/.varm.