Skip to content

Conversation

sjfleming
Copy link
Contributor

Closes #324

@sjfleming
Copy link
Contributor Author

Probably another test will be necessary @HugoHakem

Let's think about how to see if it ever actually loads the whole thing in memory

@HugoHakem
Copy link
Collaborator

We need to make sure of the following, in the event the test succeeds:

  1. Are they even using the new parameter?

  2. If yes, are they succeeding because they are, under the hood, loading the full object into memory?

    a. In that event, let's test with a bigger dataset that cannot possibly load into memory — see standard GitHub hosted runners.

    b. If that fails, we need to consider:

     i. A loader that reads the file in batches (read mode). This should not be too tricky.
     
     ii. Giving up and simply avoiding files larger than memory by pre-chunking them — eventually by providing a utility function to chunk an AnnData file.
    

@sjfleming
Copy link
Contributor Author

Hah great catch Hugo, I forgot to set the defaults to actually use backed mode! Oops...

@sjfleming
Copy link
Contributor Author

Another thing for us to figure out is:

  • we know backed mode may work for local files
  • does it actually work for remote files? like read_h5ad_url and read_h5ad_gcs ?

@sjfleming
Copy link
Contributor Author

Seems it does not

@sjfleming
Copy link
Contributor Author

Attempted a fix

@sjfleming
Copy link
Contributor Author

So the problem is that this code is going to depend on a fix for this
scverse/anndata#2064

which might be coming here
scverse/anndata#2066

Until there is a new anndata version, I think we will have to work around this by using our own updated version of anndata.

@sjfleming
Copy link
Contributor Author

sjfleming commented Jul 31, 2025

@HugoHakem I got some help from copilot in creating a way to make it seem like we have a massive h5ad file that is fake. The thing actually does not take up a lot of disk space, but if anndata tries to read it into memory, it will allocate 40GB memory and crash.

@pytest.fixture
def massive_h5ad(tmp_path: Path) -> Path:
    import h5py
    import numpy as np
    
    # Create a dataset that CLAIMS to be ~40GB but uses almost no disk space
    n_obs = 2_000_000  # 2 million cells  
    n_vars = 5_000     # 5k genes
    
    h5ad_path = tmp_path / "massive_fake.h5ad"
    
    with h5py.File(h5ad_path, "w") as f:
        # Create X dataset with claimed huge size but minimal actual storage
        # Using fillvalue=0.0 with chunking - chunks are only allocated when written to
        f.create_dataset(
            "X", 
            shape=(n_obs, n_vars),
            dtype=np.float32,
            fillvalue=0.0,
            chunks=True,  # Enable chunking so not all data needs to be stored
            compression=None  # No compression to keep it simple
        )
        
        # Create minimal obs metadata - just the index is required
        obs_group = f.create_group("obs")
        # Create a small obs index but tell HDF5 it could expand to n_obs
        obs_index_data = np.array([f"CELL_{i:07d}".encode('utf-8') for i in range(n_obs)])
        obs_group.create_dataset("_index", data=obs_index_data, maxshape=(n_obs,), dtype="S12")
        
        # Create minimal var metadata - just the index is required  
        var_group = f.create_group("var")
        var_index_data = np.array([f"GENE_{i:05d}".encode('utf-8') for i in range(n_vars)])
        var_group.create_dataset("_index", data=var_index_data, dtype="S10")
        
        # Set minimal h5ad format attributes that anndata expects
        f.attrs["encoding-type"] = "anndata"
        f.attrs["encoding-version"] = "0.1.0"
        
    return h5ad_path

Activity monitor when I run a pytest test that tries to read this thing in backed=False mode:
image

@sjfleming
Copy link
Contributor Author

It's kinda cool as a check... but it might not be great as a real pytest test because you don't really see a clear failure message. It'll just crash your computer if it doesn't work.

@sjfleming
Copy link
Contributor Author

Maybe you could turn this idea into a real test that uses some kind of memory monitoring and provides a real failure message (maybe a pytest xfail) in the case where backed mode is not used. Not sure how far we wanna go here. But I do think backed mode can work for us!

@HugoHakem
Copy link
Collaborator

Ooh super cool, good catch on that!

@HugoHakem I got some help from copilot in creating a way to make it seem like we have a massive h5ad file that is fake. The thing actually does not take up a lot of disk space, but if anndata tries to read it into memory, it will allocate 40GB memory and crash.

...

So we could use this strategy into the other test to see what happens for large anndata in backed mode (Like can it be processed or not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data loaders should read anndata files from disk in backed mode

2 participants