Please make sure these conditions are met
Report
Code:
import anndata
import numpy as np
adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}
# %%time
adata.write_h5ad("/tmp/anndata.h5ad")
# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")
On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?
Additional context
In scirpy, I use dicts of arrays (one index referring to $n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.
We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.
CC @felixpetschko
Versions
-----
anndata 0.9.2
numpy 1.24.4
session_info 1.0.0
-----
asciitree NA
asttokens NA
awkward 2.6.4
awkward_cpp NA
backcall 0.2.0
cloudpickle 2.2.1
comm 0.1.4
cython_runtime NA
dask 2023.8.1
dateutil 2.8.2
debugpy 1.6.8
decorator 5.1.1
entrypoints 0.4
executing 1.2.0
fasteners 0.18
fsspec 2023.6.0
h5py 3.9.0
importlib_metadata NA
ipykernel 6.25.0
jedi 0.19.0
...
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
-----
Session information updated at 2024-09-21 14:49
Please make sure these conditions are met
Report
Code:
On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?
Additional context
In scirpy, I use dicts of arrays (one index referring to$n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.
We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.
CC @felixpetschko
Versions