Implement dataloader h5ad reading in backed mode #325

sjfleming · 2025-07-28T17:22:24Z

Closes #324

sjfleming · 2025-07-28T17:28:38Z

Probably another test will be necessary @HugoHakem

Let's think about how to see if it ever actually loads the whole thing in memory

HugoHakem · 2025-07-28T18:32:23Z

We need to make sure of the following, in the event the test succeeds:

Are they even using the new parameter?
If yes, are they succeeding because they are, under the hood, loading the full object into memory?

a. In that event, let's test with a bigger dataset that cannot possibly load into memory — see standard GitHub hosted runners.

b. If that fails, we need to consider:
```
 i. A loader that reads the file in batches (read mode). This should not be too tricky.
 
 ii. Giving up and simply avoiding files larger than memory by pre-chunking them — eventually by providing a utility function to chunk an AnnData file.
```

sjfleming · 2025-07-28T18:53:01Z

Hah great catch Hugo, I forgot to set the defaults to actually use backed mode! Oops...

sjfleming · 2025-07-28T20:20:21Z

Another thing for us to figure out is:

we know backed mode may work for local files
does it actually work for remote files? like read_h5ad_url and read_h5ad_gcs ?

sjfleming · 2025-07-28T20:23:23Z

Seems it does not

sjfleming · 2025-07-28T20:40:58Z

Attempted a fix

sjfleming · 2025-07-30T20:42:14Z

So the problem is that this code is going to depend on a fix for this
scverse/anndata#2064

which might be coming here
scverse/anndata#2066

Until there is a new anndata version, I think we will have to work around this by using our own updated version of anndata.

sjfleming · 2025-07-31T14:17:52Z

@HugoHakem I got some help from copilot in creating a way to make it seem like we have a massive h5ad file that is fake. The thing actually does not take up a lot of disk space, but if anndata tries to read it into memory, it will allocate 40GB memory and crash.

@pytest.fixture
def massive_h5ad(tmp_path: Path) -> Path:
    import h5py
    import numpy as np
    
    # Create a dataset that CLAIMS to be ~40GB but uses almost no disk space
    n_obs = 2_000_000  # 2 million cells  
    n_vars = 5_000     # 5k genes
    
    h5ad_path = tmp_path / "massive_fake.h5ad"
    
    with h5py.File(h5ad_path, "w") as f:
        # Create X dataset with claimed huge size but minimal actual storage
        # Using fillvalue=0.0 with chunking - chunks are only allocated when written to
        f.create_dataset(
            "X", 
            shape=(n_obs, n_vars),
            dtype=np.float32,
            fillvalue=0.0,
            chunks=True,  # Enable chunking so not all data needs to be stored
            compression=None  # No compression to keep it simple
        )
        
        # Create minimal obs metadata - just the index is required
        obs_group = f.create_group("obs")
        # Create a small obs index but tell HDF5 it could expand to n_obs
        obs_index_data = np.array([f"CELL_{i:07d}".encode('utf-8') for i in range(n_obs)])
        obs_group.create_dataset("_index", data=obs_index_data, maxshape=(n_obs,), dtype="S12")
        
        # Create minimal var metadata - just the index is required  
        var_group = f.create_group("var")
        var_index_data = np.array([f"GENE_{i:05d}".encode('utf-8') for i in range(n_vars)])
        var_group.create_dataset("_index", data=var_index_data, dtype="S10")
        
        # Set minimal h5ad format attributes that anndata expects
        f.attrs["encoding-type"] = "anndata"
        f.attrs["encoding-version"] = "0.1.0"
        
    return h5ad_path

Activity monitor when I run a pytest test that tries to read this thing in backed=False mode:

sjfleming · 2025-07-31T14:34:54Z

It's kinda cool as a check... but it might not be great as a real pytest test because you don't really see a clear failure message. It'll just crash your computer if it doesn't work.

sjfleming · 2025-07-31T14:37:04Z

Maybe you could turn this idea into a real test that uses some kind of memory monitoring and provides a real failure message (maybe a pytest xfail) in the case where backed mode is not used. Not sure how far we wanna go here. But I do think backed mode can work for us!

HugoHakem · 2025-07-31T18:42:41Z

Ooh super cool, good catch on that!

@HugoHakem I got some help from copilot in creating a way to make it seem like we have a massive h5ad file that is fake. The thing actually does not take up a lot of disk space, but if anndata tries to read it into memory, it will allocate 40GB memory and crash.
...

So we could use this strategy into the other test to see what happens for large anndata in backed mode (Like can it be processed or not).

“sambasy” and others added 5 commits July 19, 2025 00:17

added backend mode to file reading (1st function)

cf8bcfd

bacekdn mode for the 2nd function

febd1a9

backend mode for the third function

c28abc5

backend mode for 4th function

0957c4c

make format, make typecheck

ea601cb

change default to backed='r'

a91efd9

sjfleming added 4 commits July 28, 2025 16:41

fix for URL and GCS readers in backed mode

c208231

testing anndata is in memory for comparison

905ea23

allow backed mode CellariumAnnDataDataModule load to pass

ca09d14

mypy

de24204

sjfleming added 4 commits July 30, 2025 18:01

typing for backed mode

d2ee9f8

parameterize backed mode in LazyAnnData and DistributedAnnDataCollection

a634b92

use backed_mode_type

83c514f

docstring update to reflect lack of "r+" mode

1b81428

sjfleming added 2 commits July 31, 2025 10:33

test memory pressure in backed mode with massive faked h5ad

983178a

format

39bc582

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement dataloader h5ad reading in backed mode #325

Implement dataloader h5ad reading in backed mode #325

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

HugoHakem commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 30, 2025

Uh oh!

sjfleming commented Jul 31, 2025 •

edited

Loading

Uh oh!

sjfleming commented Jul 31, 2025

Uh oh!

sjfleming commented Jul 31, 2025

Uh oh!

HugoHakem commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement dataloader h5ad reading in backed mode #325

Are you sure you want to change the base?

Implement dataloader h5ad reading in backed mode #325

Uh oh!

Conversation

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

HugoHakem commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 28, 2025

Uh oh!

sjfleming commented Jul 30, 2025

Uh oh!

sjfleming commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjfleming commented Jul 31, 2025

Uh oh!

sjfleming commented Jul 31, 2025

Uh oh!

HugoHakem commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjfleming commented Jul 31, 2025 •

edited

Loading