Usage of H5py Virtual Datasets for `concat_on_disk` #2032

selmanozleyen · 2025-07-17T09:34:18Z

When there is no need for reindexing and the backend is hdf5 we can just use the virtual datasets instead. This requires no in memory copies but instead it just links to the original file location instead. I was able to concat the tahoe datasets (314GB in total) in a few minutes and the result was a 12GB .h5ad file.

Other notes:

The indptr's were in total 780Mb for all the tahoe files so I just concatenate them in memory instead.
Added TODO for being able to specify compression args since when I used the default approach the output file size grew up too much compared to the size of the inputs.

TODOs:

Tests added (tests already cover this case)
Release note added (or unnecessary)
Write about this feature in the docstrings

codecov · 2025-07-17T09:36:12Z

Codecov Report

❌ Patch coverage is 24.48980% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 31.80%. Comparing base (19c8e59) to head (e1833a2).

Files with missing lines	Patch %	Lines
src/anndata/_core/sparse_dataset.py	0.00%	35 Missing ⚠️
src/anndata/experimental/merge.py	85.71%	2 Missing ⚠️

❌ Your project check has failed because the head coverage (31.80%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (19c8e59) and HEAD (e1833a2). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (19c8e59) HEAD (e1833a2)

7 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2032       +/-   ##
===========================================
- Coverage   87.58%   31.80%   -55.79%     
===========================================
  Files          46       46               
  Lines        7064     7103       +39     
===========================================
- Hits         6187     2259     -3928     
- Misses        877     4844     +3967

Files with missing lines	Coverage Δ
src/anndata/experimental/merge.py	`63.25% <85.71%> (-23.95%)`	⬇️
src/anndata/_core/sparse_dataset.py	`51.06% <0.00%> (-41.61%)`	⬇️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold

I think this should be put behind an argument so it is opt-in
A single test should be enough to ensure results match
When the dataset is read back in, and a backing file has been deleted, does h5 raise an error? Or will it error weirdly somehow within anndata once you try to access the data?

selmanozleyen · 2025-07-29T10:32:31Z

I think this should be put behind an argument so it is opt-in
A single test should be enough to ensure results match

These are done

h5 raise an error?

According to the requirements documentation: https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf

It will be the user’s responsibility to maintain the consistency of a VDS. If source files are
unavailable, the library will report an error or use fill values when doing I/O on the VDS
elements mapped to the elements in the missing source files.

From my experience it tries as hard to use a cache or something like that because when I delete original files it doesn't throw an error and the result file looks as if it didn't change. This behaviour isn't discussed much in https://docs.h5py.org

ilan-gold · 2025-08-28T13:33:56Z

@selmanozleyen Do you want my review on this again or are the comments still somewhat unaddressed?

selmanozleyen · 2025-08-28T15:04:12Z

Do you want my review on this again or are the comments still somewhat unaddressed?

I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs. Sometimes it errors sometimes it doesn't as in the CI tests.

If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge

ilan-gold

I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs.

Can we ask the developers of h5py?

If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge

I would like to at least open an issue with the h5py people and give it a day or two (I have some changes requested here anyway)

ilan-gold · 2025-08-29T09:19:39Z

tests/test_concatenate_disk.py

+    )
+
+
+def test_anndatas_virtual_concat_missing_file(


We shouldn't have multiple different functions that do almost identical things. Please refactor

ilan-gold · 2025-08-29T09:24:06Z

src/anndata/experimental/merge.py

    path,
+    *,
    max_loaded_elems,
+    virtual_concat,


Typing! and virtual_concat is a boolean, so let's prefix this everywhere with a verb like use i.e., use_virtual_concat

ilan-gold · 2025-08-29T09:26:17Z

src/anndata/experimental/merge.py

+            init_elem,  # TODO: user should be able to specify dataset kwargs
+            dataset_kwargs=dict(indptr_dtype=indptr_dtype),
+        )


I see this as the TODO but I'm not sure its relevance to

Added TODO for being able to specify compression args since when I used the default approach the output file size grew up too much compared to the size of the inputs.

or in general, why the resultant dataset is 12GB. Is this obs and var or?

ilan-gold · 2025-08-29T09:27:40Z

src/anndata/experimental/merge.py

+    For sparse arrays, if the backend is hdf5 and there is no reindexing and
+    `virtual_concat` is True,
+    the virtual concatenation is used using the `h5py` virtual dataset support.


Suggested change

For sparse arrays, if the backend is hdf5 and there is no reindexing and

`virtual_concat` is True,

the virtual concatenation is used using the `h5py` virtual dataset support.

For sparse arrays, if the backend is hdf5 and there is no reindexing and

`virtual_concat` is True,

virtual concatenation is used via docs.h5py.org/en/stable/vds.html.

Also maybe see if we can intersphinx this link? I suspect so but am not sure

selmanozleyen added 2 commits July 17, 2025 11:22

passes the tests

80278a0

remove unused ds_kwargs

4d64737

selmanozleyen added 4 commits July 17, 2025 11:55

add release notes and docs

e56fd65

remove double computation of total_nnz

bb4598b

fix mistake

366dc6f

rename the redundant "sparse" from the function name

aa7756b

ilan-gold added this to the 0.12.1 milestone Jul 17, 2025

ilan-gold reviewed Jul 17, 2025

View reviewed changes

ilan-gold added the skip-gpu-ci label Jul 17, 2025

add tests and make this an optional argument

d2276d9

ilan-gold modified the milestones: 0.12.1, 0.12.2 Jul 23, 2025

selmanozleyen added 2 commits July 29, 2025 11:16

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

110ba5d

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

74250b7

selmanozleyen and others added 4 commits July 29, 2025 12:34

add test to ensure error message

dfdb99a

fix error on other tests

76c805d

linting

eb64df7

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

b4ac4a5

flying-sheep assigned selmanozleyen Aug 7, 2025

selmanozleyen added 2 commits August 19, 2025 11:35

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

5d3eab2

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

e1833a2

Merge branch 'main' into feature/concat_hdf5_virtual_datasets

729c0f1

ilan-gold requested changes Aug 29, 2025

View reviewed changes

ilan-gold modified the milestones: 0.12.2, 0.12.3 Oct 15, 2025

ilan-gold added this to the 0.12.4 milestone Oct 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage of H5py Virtual Datasets for `concat_on_disk` #2032

Usage of H5py Virtual Datasets for `concat_on_disk` #2032

Uh oh!

selmanozleyen commented Jul 17, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 17, 2025 •

edited

Loading

Uh oh!

ilan-gold left a comment

Uh oh!

selmanozleyen commented Jul 29, 2025 •

edited

Loading

Uh oh!

ilan-gold commented Aug 28, 2025

Uh oh!

selmanozleyen commented Aug 28, 2025

Uh oh!

ilan-gold left a comment

Uh oh!

ilan-gold Aug 29, 2025

Uh oh!

ilan-gold Aug 29, 2025

Uh oh!

ilan-gold Aug 29, 2025

Uh oh!

ilan-gold Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def test_anndatas_virtual_concat_missing_file(

Usage of H5py Virtual Datasets for concat_on_disk #2032

Are you sure you want to change the base?

Usage of H5py Virtual Datasets for concat_on_disk #2032

Uh oh!

Conversation

selmanozleyen commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

selmanozleyen commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Aug 28, 2025

Uh oh!

selmanozleyen commented Aug 28, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Usage of H5py Virtual Datasets for `concat_on_disk` #2032

Usage of H5py Virtual Datasets for `concat_on_disk` #2032

selmanozleyen commented Jul 17, 2025 •

edited

Loading

codecov bot commented Jul 17, 2025 •

edited

Loading

selmanozleyen commented Jul 29, 2025 •

edited

Loading