Please make sure these conditions are met
Report
See:
|
number_non_zero = sum(len(d.group["indices"]) for d in datasets) |
.experimental.concat_on_disk is an awesome feature. Thanks for writing it!
I found in practice that the dtype inference for indptr in the final object can fail in curious ways. It seems to be overly aggressive with casting to int32, which leads to a traceback when merging objects large enough to require int64.
Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.
Code:
# anndata paths contains objects with >10e6 observations and >1e4 features
# on average probably ~80-90% sparse
anndata.experimental.concat_on_disk(
in_files=adata_paths,
out_file=out_path,
max_loaded_elems=int(1e10),
axis=0,
join="inner",
label="concat_batch",
keys=keys,
index_unique="::",
)
Traceback:
File "/efs/home/jacob/mambaforge/envs/scpy/lib/python3.10/site-packages/anndata/_core/sparse_dataset.py", line 499, in append
raise OverflowError(
OverflowError: This array was written with a 32 bit intptr, but is now large enough to require 64 bit values. Please recreate the array with a 64 bit indptr.
Versions
IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0
asttokens NA
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.1.0
h5py 3.12.1
jedi 0.19.1
natsort 8.4.0
numpy 2.1.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
prompt_toolkit 3.0.48
pure_eval 0.2.3
pygments 2.18.0
pytz 2024.2
scipy 1.14.1
six 1.16.0
stack_data 0.6.3
traitlets 5.14.3
wcwidth 0.2.13
Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35
Session information updated at 2024-10-10 03:56
Please make sure these conditions are met
Report
See:
anndata/src/anndata/experimental/merge.py
Line 223 in 8e9eb88
.experimental.concat_on_diskis an awesome feature. Thanks for writing it!I found in practice that the dtype inference for
indptrin the final object can fail in curious ways. It seems to be overly aggressive with casting toint32, which leads to a traceback when merging objects large enough to requireint64.Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded
int64for the output object, everything completed successfuly.Code:
Traceback:
Versions
IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0
asttokens NA
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.1.0
h5py 3.12.1
jedi 0.19.1
natsort 8.4.0
numpy 2.1.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
prompt_toolkit 3.0.48
pure_eval 0.2.3
pygments 2.18.0
pytz 2024.2
scipy 1.14.1
six 1.16.0
stack_data 0.6.3
traitlets 5.14.3
wcwidth 0.2.13
Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35
Session information updated at 2024-10-10 03:56