diff --git a/docs/index.md b/docs/index.md index dd61acc..aad5984 100644 --- a/docs/index.md +++ b/docs/index.md @@ -117,6 +117,5 @@ api.md changelog.md contributing.md references.md - -notebooks/example +notebooks/index ``` diff --git a/docs/notebooks/example.ipynb b/docs/notebooks/example.ipynb index 24eab28..6ffa0ce 100644 --- a/docs/notebooks/example.ipynb +++ b/docs/notebooks/example.ipynb @@ -4,13 +4,338 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Example notebook" + "# Quickstart `annbatch`" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook will walk you through the following steps:\n", + "1. How to convert an existing collection of `anndata` files into a shuffled, zarr-based, collection of `anndata` datasets\n", + "2. How to load the converted collection using `annbatch`\n", + "3. Extend an existing collection with new `anndata` datasets\n", + "\n", + "To use this notebook, install the extras:\n", + "\n", + "```\n", + "pip install annbatch[zarrs, torch]\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2025-10-09 09:43:19-- https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\n", + "Resolving datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)... 18.64.79.73, 18.64.79.80, 18.64.79.109, ...\n", + "Connecting to datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)|18.64.79.73|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 773247972 (737M) [binary/octet-stream]\n", + "Saving to: ‘866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad’\n", + "\n", + "866d7d5e-436b-4dbd- 100%[===================>] 737.43M 398MB/s in 1.9s \n", + "\n", + "2025-10-09 09:43:21 (398 MB/s) - ‘866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad’ saved [773247972/773247972]\n", + "\n", + "--2025-10-09 09:43:22-- https://datasets.cellxgene.cziscience.com/f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad\n", + "Resolving datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)... 18.64.79.73, 18.64.79.80, 18.64.79.72, ...\n", + "Connecting to datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)|18.64.79.73|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 1631759823 (1.5G) [binary/octet-stream]\n", + "Saving to: ‘f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad’\n", + "\n", + "f81463b8-4986-4904- 100%[===================>] 1.52G 425MB/s in 3.9s \n", + "\n", + "2025-10-09 09:43:26 (403 MB/s) - ‘f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad’ saved [1631759823/1631759823]\n", + "\n" + ] + } + ], + "source": [ + "# Download two example datasets from CELLxGENE\n", + "!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\n", + "!wget https://datasets.cellxgene.cziscience.com/f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**IMPORTANT**: Configure zarrs\n", + "\n", + "This step is both required for converting existing `anndata` files into a performant, shuffled collection of datasets for mini batch loading" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import zarr\n", + "import zarrs # noqa\n", + "\n", + "zarr.config.set({\"codec_pipeline.path\": \"zarrs.ZarrsCodecPipeline\"})" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "# Suppress zarr vlen-utf8 codec warnings\n", + "warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=\"The codec `vlen-utf8` is currently not part in the Zarr format 3 specification.*\",\n", + " category=UserWarning,\n", + " module=\"zarr.codecs.vlen_utf8\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Converting existing `anndata` files into a shuffled collection" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The conversion code will take care of the following things:\n", + "* Align (outer join) the gene spaces across all datasets listed in `adata_paths`\n", + " * The gene spaces are outer-joined based on the gene names provided in the `var_names` field of the individual `AnnData` objects.\n", + " * If you want to subset to specific gene space, you can provide a list of gene names via the `var_subset` parameter.\n", + "* Shuffle the cells across all datasets (this works on larger than memory datasets as well).\n", + " * This is important for block-wise shuffling during data loading.\n", + "* Shuffle the input files across multiple output datasets:\n", + " * The size of each individual output dataset can be controlled via the `n_obs_per_dataset` parameter.\n", + " * We recommend to choose a dataset size that comfortably fits into system memory." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/mnt/volume/arrayloaders/src/annbatch/io.py:228: UserWarning: Some anndatas have layers keys not present in others' layers, consider stopping and using the `transform_input_adata` argument to alter layers accordingly.\n", + " adata_concat = _lazy_load_with_obs_var_in_memory(adata_paths)\n", + " 0%| | 0/1 [00:00 AnnData:\n", + " del adata.layers # soupX is present in one of the datasets' layers but not the other\n", + " return adata\n", + "\n", + "create_anndata_collection(\n", + " # List all the h5ad files you want to include in the collection\n", + " adata_paths=[\"866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\", \"f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad\"],\n", + " # Path to store the output collection\n", + " output_path=\"annbatch_collection\",\n", + " shuffle=True, # Whether to pre-shuffle the cells of the collection\n", + " n_obs_per_dataset=2_097_152, # Number of cells per dataset shard\n", + " var_subset=None, # Optionally subset the collection to a specific gene space\n", + " should_denseify=False,\n", + " transform_input_adata=del_layers\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data loading example" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "COLLECTION_PATH = Path(\"annbatch_collection/\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import anndata as ad\n", + "\n", + "from annbatch import ZarrSparseDataset\n", + "\n", + "ds = ZarrSparseDataset(\n", + " batch_size=4096, # Total number of obs per yielded batch\n", + " chunk_size=256, # Number of obs to load from disk contiguously - default settings should work well\n", + " preload_nchunks=32, # Number of chunks to preload + shuffle - default settings should work well\n", + " preload_to_gpu=False, # If True, preloaded chunks are moved to GPU memory via `cupy`, which can put more pressure on GPU memory but will accelerate loading ~20%\n", + " to_torch=True\n", + ")\n", + "\n", + "# Add dataset that should be used for training\n", + "ds.add_anndatas(\n", + " [\n", + " ad.AnnData(\n", + " X=ad.io.sparse_dataset(zarr.open(p)[\"X\"]),\n", + " obs=ad.io.read_elem(zarr.open(p)[\"obs\"]),\n", + " )\n", + " for p in COLLECTION_PATH.glob(\"*.zarr\")\n", + " ],\n", + " obs_keys=\"cell_type\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**IMPORTANT:**\n", + "* The `ZarrSparseDataset` yields batches of sparse tensors.\n", + "* The conversion to dense tensors should be done on the GPU, as shown in the example below.\n", + " * First call `.cuda()` and then `.to_dense()`\n", + " * E.g. `x = x.cuda().to_dense()`\n", + " * This is significantly faster than doing the dense conversion on the CPU.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/171792 [00:00