Skip to content

feat: Usage of H5py Virtual Datasets for concat_on_disk#2032

Open
selmanozleyen wants to merge 25 commits intoscverse:mainfrom
selmanozleyen:feature/concat_hdf5_virtual_datasets
Open

feat: Usage of H5py Virtual Datasets for concat_on_disk#2032
selmanozleyen wants to merge 25 commits intoscverse:mainfrom
selmanozleyen:feature/concat_hdf5_virtual_datasets

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Jul 17, 2025

When there is no need for reindexing and the backend is hdf5 we can just use the virtual datasets instead. This requires no in memory copies but instead it just links to the original file location instead. I was able to concat the tahoe datasets (314GB in total) in a few minutes and the result was a 12GB .h5ad file.

Other notes:

  • The indptr's were in total 780Mb for all the tahoe files so I just concatenate them in memory instead.
  • Added TODO for being able to specify compression args since when I used the default approach the output file size grew up too much compared to the size of the inputs.

TODOs:

  • Tests added (tests already cover this case)
  • Release note added (or unnecessary)
  • Write about this feature in the docstrings

@codecov
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.71%. Comparing base (c6f6f54) to head (543ca00).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2032      +/-   ##
==========================================
- Coverage   86.64%   84.71%   -1.93%     
==========================================
  Files          46       46              
  Lines        7218     7276      +58     
==========================================
- Hits         6254     6164      -90     
- Misses        964     1112     +148     
Files with missing lines Coverage Δ
src/anndata/_core/sparse_dataset.py 93.71% <100.00%> (+0.92%) ⬆️
src/anndata/experimental/merge.py 91.83% <100.00%> (+0.84%) ⬆️

... and 7 files with indirect coverage changes

@ilan-gold ilan-gold added this to the 0.12.1 milestone Jul 17, 2025
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think this should be put behind an argument so it is opt-in
  2. A single test should be enough to ensure results match
  3. When the dataset is read back in, and a backing file has been deleted, does h5 raise an error? Or will it error weirdly somehow within anndata once you try to access the data?

@ilan-gold ilan-gold modified the milestones: 0.12.1, 0.12.2 Jul 23, 2025
@selmanozleyen
Copy link
Member Author

selmanozleyen commented Jul 29, 2025

I think this should be put behind an argument so it is opt-in
A single test should be enough to ensure results match

These are done

h5 raise an error?

According to the requirements documentation: https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf

It will be the user’s responsibility to maintain the consistency of a VDS. If source files are
unavailable, the library will report an error or use fill values when doing I/O on the VDS
elements mapped to the elements in the missing source files.

From my experience it tries as hard to use a cache or something like that because when I delete original files it doesn't throw an error and the result file looks as if it didn't change. This behaviour isn't discussed much in https://docs.h5py.org

@ilan-gold
Copy link
Contributor

@selmanozleyen Do you want my review on this again or are the comments still somewhat unaddressed?

@selmanozleyen
Copy link
Member Author

Do you want my review on this again or are the comments still somewhat unaddressed?

I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs. Sometimes it errors sometimes it doesn't as in the CI tests.

If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs.

Can we ask the developers of h5py?

If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge

I would like to at least open an issue with the h5py people and give it a day or two (I have some changes requested here anyway)

)


def test_anndatas_virtual_concat_missing_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have multiple different functions that do almost identical things. Please refactor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +247 to +249
init_elem, # TODO: user should be able to specify dataset kwargs
dataset_kwargs=dict(indptr_dtype=indptr_dtype),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as the TODO but I'm not sure its relevance to

Added TODO for being able to specify compression args since when I used the default approach the output file size grew up too much compared to the size of the inputs.

or in general, why the resultant dataset is 12GB. Is this obs and var or?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but should I do an analyis on this? In general passing datasets kwargs sounds like a good idea to me given that this function is already for advanced users and larger scales

@ilan-gold ilan-gold modified the milestones: 0.12.2, 0.12.3, 0.12.4 Oct 15, 2025
@ilan-gold ilan-gold modified the milestones: 0.12.4, 0.12.5 Oct 27, 2025
@ilan-gold ilan-gold modified the milestones: 0.12.5, 0.12.7 Nov 6, 2025
@flying-sheep flying-sheep modified the milestones: 0.12.7, 0.12.8 Dec 16, 2025
@selmanozleyen selmanozleyen force-pushed the feature/concat_hdf5_virtual_datasets branch from 729c0f1 to aa468c8 Compare January 19, 2026 12:59
@selmanozleyen selmanozleyen changed the title Usage of H5py Virtual Datasets for concat_on_disk perf: Usage of H5py Virtual Datasets for concat_on_disk Jan 19, 2026
@selmanozleyen selmanozleyen changed the title perf: Usage of H5py Virtual Datasets for concat_on_disk feat: Usage of H5py Virtual Datasets for concat_on_disk Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants