Please describe your wishes and possible alternatives to achieve the desired result.
At the moment, everything that is in a numpy.ndarray is treated identically when writing to disk (with some exceptions like rec-array and string-array):
https://github.com/scverse/anndata/blob/main/src/anndata/_io/specs/methods.py#L414
However, zarr currently has support for totally custom data types. At the moment, I don't think this is too much of a problem (in theory, you could put something custom or a datetime in obsm or X but I don't think this is really something to be too worried about at the moment).
Beyond those pathological cases, I think the one place this sort of thing could in theory happen would be via obs with datetimes
import anndata as ad
import pandas as pd
arr = pd.date_range("2018-01-01", periods=5, freq="h").to_numpy()
ts = pd.Series(arr, index=range(len(arr)))
df = pd.DataFrame({"dt": ts})
ad.AnnData(obs=df).write_zarr("foo.zarr")
although this errors out with
IORegistryError: No method registered for writing <class 'pandas.core.arrays.datetimes.DatetimeArray'> into <class 'zarr.core.group.Group'>
Error raised while writing key 'dt' of <class 'zarr.core.group.Group'> to /obs
so it's possible that one can not do this directly although with write_elem I think it could be done, even for custom dtypes:
import zarr
z = zarr.open("hooray2.zarr")
ad.io.write_elem(z, "arr", ts.values)
which will write out a zarr dtype of
"data_type": {
"name": "numpy.datetime64",
"configuration": {
"unit": "ns",
"scale_factor": 1
}
},
So I think this sort of thing is instructive as the goal we want - we should be able to write out obs with datetime because it is an officially supported zarr dtype, and we should do it with confidence!
More pressingly, https://hackmd.io/@zarr/SyFKW_A6xg?utm_source=preview-mode&utm_medium=rec will come at some point so we should probably begin future proofing for arrow-in-zarr. It will be implemented via https://zarr.readthedocs.io/en/stable/user-guide/data_types/#integral
Some action items:
- Figure out a mechanism for how to handle custom zarr dtypes generically
- Decide that we will go one by one i.e., we handle only certain ones
- Update the
array spec to be generic over underlying type as opposed to numeric. Note that either way, this will begin a divergence from hdf5 - some things will not be able to be roundtripped. This is also a decision point, how to handle/mark this sort of thing.
If we do not go with option 3. or 2. we should probably start checking that array types for being numeric.
xref #2043
cc @ivirshup @keller-mark
Please describe your wishes and possible alternatives to achieve the desired result.
At the moment, everything that is in a
numpy.ndarrayis treated identically when writing to disk (with some exceptions likerec-arrayandstring-array):https://github.com/scverse/anndata/blob/main/src/anndata/_io/specs/methods.py#L414
However, zarr currently has support for totally custom data types. At the moment, I don't think this is too much of a problem (in theory, you could put something custom or a datetime in
obsmorXbut I don't think this is really something to be too worried about at the moment).Beyond those pathological cases, I think the one place this sort of thing could in theory happen would be via
obswith datetimesalthough this errors out with
so it's possible that one can not do this directly although with
write_elemI think it could be done, even for custom dtypes:which will write out a zarr dtype of
So I think this sort of thing is instructive as the goal we want - we should be able to write out
obswith datetime because it is an officially supported zarr dtype, and we should do it with confidence!More pressingly, https://hackmd.io/@zarr/SyFKW_A6xg?utm_source=preview-mode&utm_medium=rec will come at some point so we should probably begin future proofing for arrow-in-zarr. It will be implemented via https://zarr.readthedocs.io/en/stable/user-guide/data_types/#integral
Some action items:
rec-arrayto the spec (it isn't there)arrayspec to be generic over underlying type as opposed to numeric. Note that either way, this will begin a divergence fromhdf5- some things will not be able to be roundtripped. This is also a decision point, how to handle/mark this sort of thing.If we do not go with option 3. or 2. we should probably start checking that
arraytypes for being numeric.xref #2043
cc @ivirshup @keller-mark