Skip to content

.experimental.concat_on_disk fails to properly infer indptr dtype for the final object #1709

@jacobkimmel

Description

@jacobkimmel

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

See:

number_non_zero = sum(len(d.group["indices"]) for d in datasets)

.experimental.concat_on_disk is an awesome feature. Thanks for writing it!

I found in practice that the dtype inference for indptr in the final object can fail in curious ways. It seems to be overly aggressive with casting to int32, which leads to a traceback when merging objects large enough to require int64.

Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.

Code:

# anndata paths contains objects with >10e6 observations and >1e4 features
# on average probably ~80-90% sparse
anndata.experimental.concat_on_disk(
    in_files=adata_paths,
    out_file=out_path,
    max_loaded_elems=int(1e10),
    axis=0,
    join="inner",
    label="concat_batch",
    keys=keys,
    index_unique="::",
)

Traceback:

  File "/efs/home/jacob/mambaforge/envs/scpy/lib/python3.10/site-packages/anndata/_core/sparse_dataset.py", line 499, in append
    raise OverflowError(
OverflowError: This array was written with a 32 bit intptr, but is now large enough to require 64 bit values. Please recreate the array with a 64 bit indptr.

Versions


IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0

asttokens NA
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.1.0
h5py 3.12.1
jedi 0.19.1
natsort 8.4.0
numpy 2.1.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
prompt_toolkit 3.0.48
pure_eval 0.2.3
pygments 2.18.0
pytz 2024.2
scipy 1.14.1
six 1.16.0
stack_data 0.6.3
traitlets 5.14.3
wcwidth 0.2.13

Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35

Session information updated at 2024-10-10 03:56

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions