Skip to content

Handle new zarr dtypes (both custom and datetime) #2238

@ilan-gold

Description

@ilan-gold

Please describe your wishes and possible alternatives to achieve the desired result.

At the moment, everything that is in a numpy.ndarray is treated identically when writing to disk (with some exceptions like rec-array and string-array):

https://github.com/scverse/anndata/blob/main/src/anndata/_io/specs/methods.py#L414

However, zarr currently has support for totally custom data types. At the moment, I don't think this is too much of a problem (in theory, you could put something custom or a datetime in obsm or X but I don't think this is really something to be too worried about at the moment).

Beyond those pathological cases, I think the one place this sort of thing could in theory happen would be via obs with datetimes

import anndata as ad
import pandas as pd

arr = pd.date_range("2018-01-01", periods=5, freq="h").to_numpy()
ts = pd.Series(arr, index=range(len(arr)))
df = pd.DataFrame({"dt": ts})
 ad.AnnData(obs=df).write_zarr("foo.zarr")

although this errors out with

IORegistryError: No method registered for writing <class 'pandas.core.arrays.datetimes.DatetimeArray'> into <class 'zarr.core.group.Group'>
Error raised while writing key 'dt' of <class 'zarr.core.group.Group'> to /obs

so it's possible that one can not do this directly although with write_elem I think it could be done, even for custom dtypes:

import zarr
z = zarr.open("hooray2.zarr")
ad.io.write_elem(z, "arr", ts.values)

which will write out a zarr dtype of

"data_type": {
    "name": "numpy.datetime64",
    "configuration": {
      "unit": "ns",
      "scale_factor": 1
    }
  },

So I think this sort of thing is instructive as the goal we want - we should be able to write out obs with datetime because it is an officially supported zarr dtype, and we should do it with confidence!

More pressingly, https://hackmd.io/@zarr/SyFKW_A6xg?utm_source=preview-mode&utm_medium=rec will come at some point so we should probably begin future proofing for arrow-in-zarr. It will be implemented via https://zarr.readthedocs.io/en/stable/user-guide/data_types/#integral

Some action items:

  • Add rec-array to the spec (it isn't there)
  • Plan the next step after some discussion. Some possibilities:
  1. Figure out a mechanism for how to handle custom zarr dtypes generically
  2. Decide that we will go one by one i.e., we handle only certain ones
  3. Update the array spec to be generic over underlying type as opposed to numeric. Note that either way, this will begin a divergence from hdf5 - some things will not be able to be roundtripped. This is also a decision point, how to handle/mark this sort of thing.
  • Either implement the mechanism or add in the new supported types we want (i.e., arrow) as they come
  • Handle datetime dtypes as well

If we do not go with option 3. or 2. we should probably start checking that array types for being numeric.

xref #2043

cc @ivirshup @keller-mark

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Enhancement.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions