-
Notifications
You must be signed in to change notification settings - Fork 4
Implement dataloader h5ad reading in backed mode #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Probably another test will be necessary @HugoHakem Let's think about how to see if it ever actually loads the whole thing in memory |
We need to make sure of the following, in the event the test succeeds:
|
Hah great catch Hugo, I forgot to set the defaults to actually use backed mode! Oops... |
Another thing for us to figure out is:
|
Seems it does not |
Attempted a fix |
So the problem is that this code is going to depend on a fix for this which might be coming here Until there is a new anndata version, I think we will have to work around this by using our own updated version of anndata. |
@HugoHakem I got some help from copilot in creating a way to make it seem like we have a massive h5ad file that is fake. The thing actually does not take up a lot of disk space, but if anndata tries to read it into memory, it will allocate 40GB memory and crash. @pytest.fixture
def massive_h5ad(tmp_path: Path) -> Path:
import h5py
import numpy as np
# Create a dataset that CLAIMS to be ~40GB but uses almost no disk space
n_obs = 2_000_000 # 2 million cells
n_vars = 5_000 # 5k genes
h5ad_path = tmp_path / "massive_fake.h5ad"
with h5py.File(h5ad_path, "w") as f:
# Create X dataset with claimed huge size but minimal actual storage
# Using fillvalue=0.0 with chunking - chunks are only allocated when written to
f.create_dataset(
"X",
shape=(n_obs, n_vars),
dtype=np.float32,
fillvalue=0.0,
chunks=True, # Enable chunking so not all data needs to be stored
compression=None # No compression to keep it simple
)
# Create minimal obs metadata - just the index is required
obs_group = f.create_group("obs")
# Create a small obs index but tell HDF5 it could expand to n_obs
obs_index_data = np.array([f"CELL_{i:07d}".encode('utf-8') for i in range(n_obs)])
obs_group.create_dataset("_index", data=obs_index_data, maxshape=(n_obs,), dtype="S12")
# Create minimal var metadata - just the index is required
var_group = f.create_group("var")
var_index_data = np.array([f"GENE_{i:05d}".encode('utf-8') for i in range(n_vars)])
var_group.create_dataset("_index", data=var_index_data, dtype="S10")
# Set minimal h5ad format attributes that anndata expects
f.attrs["encoding-type"] = "anndata"
f.attrs["encoding-version"] = "0.1.0"
return h5ad_path Activity monitor when I run a pytest test that tries to read this thing in |
It's kinda cool as a check... but it might not be great as a real pytest test because you don't really see a clear failure message. It'll just crash your computer if it doesn't work. |
Maybe you could turn this idea into a real test that uses some kind of memory monitoring and provides a real failure message (maybe a pytest xfail) in the case where backed mode is not used. Not sure how far we wanna go here. But I do think backed mode can work for us! |
Ooh super cool, good catch on that!
So we could use this strategy into the other test to see what happens for large anndata in backed mode (Like can it be processed or not). |
Closes #324