AllenInstitute · zrlewis · May 22, 2025
diff --git a/notebooks/10x_snRNASeq_tutorial_part_2c.ipynb b/notebooks/10x_snRNASeq_tutorial_part_2c.ipynb
@@ -0,0 +1,363 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1a3a7f8f",
+   "metadata": {},
+   "source": [
+    "#10x RNA-seq gene expression data (part 2c)\n",
+    "\n",
+    "We perform subsetting of the Whole Mouse Brain Atlas at a particular taxonomic level. This allows us to create manageable `h5ad` files of certain branches of the taxonomy. \n",
+    "\n",
+    "You need to be connected to the internet to run this notebook and have run through the [getting started notebook](https://alleninstitute.github.io/abc_atlas_access/notebooks/getting_started.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb01de16",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from pathlib import Path\n",
+    "import numpy as np\n",
+    "import anndata\n",
+    "import time\n",
+    "\n",
+    "from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a72a3512",
+   "metadata": {},
+   "source": [
+    "We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.\n",
+    "\n",
+    "**Change the download_base variable to where you have downloaded the data in your system.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4f24a0ff",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "download_base = Path('../../data/abc_atlas')\n",
+    "abc_cache = AbcProjectCache.from_cache_dir(download_base)\n",
+    "abc_cache.list_manifest_file_names"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cb0dc2e3",
+   "metadata": {},
+   "source": [
+    "Read in the expanded cell metadata table from the cache. Examples of creating this table are presented in part 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "35186179",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# load cell metadata\n",
+    "cell_metadata = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata_with_cluster_annotation')\n",
+    "cell_metadata.set_index('cell_label',inplace=True)\n",
+    "cell_metadata.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7471b41f",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# list columns of the metadata\n",
+    "list(cell_metadata)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "78a6149f",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# generate a table of matrices\n",
+    "matrices = cell.groupby(['dataset_label', 'feature_matrix_label'])[['library_label']].count()\n",
+    "matrices.columns  = ['cell_count']\n",
+    "matrices"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47cb01b2",
+   "metadata": {},
+   "source": [
+    "Below we define a function to extract whole transcriptomic data from selected cells across all matrices in the database. We supply a taxonomic level (`target_grouping_level`) to perform the subsetting on and a `target_grouping_label` which represents the name of the group that we hope to capture."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9ae45f3",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def extract_grouping_to_h5ad(\n",
+    "    cell_metadata: pd.DataFrame,\n",
+    "    feature_matrices_info: pd.DataFrame,\n",
+    "    target_grouping_label: str,\n",
+    "    target_grouping_level: str,\n",
+    "    output_filename: str,\n",
+    "    data_directory: str = None, # Optional: if not using abc_atlas_cache directly\n",
+    "    limit_matrices: int = None, # Optional: for testing, limit number of matrices processed\n",
+    "    additional_metadata_cols: list = None # New: List of column names from cell_metadata to add to .obs\n",
+    ") -> str:\n",
+    "    \"\"\"\n",
+    "    Extracts all cells belonging to a specified subclass from multiple feature matrices\n",
+    "    and saves them into a new AnnData (.h5ad) file, keeping all genes.\n",
+    "    Optionally merges additional metadata columns from the cell_metadata DataFrame\n",
+    "    into the .obs DataFrame of the resulting AnnData object.\n",
+    "\n",
+    "    Args:\n",
+    "        cell_metadata (pd.DataFrame): DataFrame containing cell metadata,\n",
+    "            must include 'feature_matrix_label' and 'subclass_label' columns,\n",
+    "            with cell IDs as index.\n",
+    "        feature_matrices_info (pd.DataFrame): DataFrame containing information about\n",
+    "            feature matrices, with (dataset_id, matrix_path) as index.\n",
+    "        target_grouping_level (str): The level of the taxonomy that you want to select \n",
+    "            from (e.g., 'subclass', 'supertype')\n",
+    "        target_grouping_label (str): The specific subclass label to filter for\n",
+    "            (e.g., \"022 L5 ET CTX Glut\").\n",
+    "        output_filename (str): The name of the .h5ad file to save the extracted data to.\n",
+    "        data_directory (str, optional): The base directory where the .h5ad files are located.\n",
+    "            If None, abc_atlas_cache.get_data_path will be used. Defaults to None.\n",
+    "        limit_matrices (int, optional): If provided, limits the number of feature matrices\n",
+    "            to process. Useful for testing. Defaults to None.\n",
+    "        additional_metadata_cols (list, optional): A list of column names from `cell_metadata`\n",
+    "            to be added to the `.obs` DataFrame of the output AnnData object.\n",
+    "            Defaults to None (no additional columns).\n",
+    "\n",
+    "    Returns:\n",
+    "        str: The path to the newly created .h5ad file.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    extracted_adata_list = []\n",
+    "    processed_count = 0\n",
+    "    total_start_time = time.process_time()\n",
+    "\n",
+    "    print(f\"Starting extraction for grouping: '{target_grouping_label}'...\")\n",
+    "\n",
+    "    for mat_index in feature_matrices_info.index:\n",
+    "        ds = mat_index[0] # dataset_id\n",
+    "        mp = mat_index[1] # matrix_path (e.g., \"WMB-10x_nuclei-001/log2\")\n",
+    "\n",
+    "        print(f\"Processing matrix: {mp}\")\n",
+    "\n",
+    "        # Construct the full path to the h5ad file\n",
+    "        if data_directory:\n",
+    "            file_path = os.path.join(data_directory, mp, 'log2') # Assuming 'log2' is part of the filename\n",
+    "        else:\n",
+    "            # Use abc_atlas_cache if data_directory is not provided\n",
+    "            file_path = abc_cache.get_data_path(directory=ds, file_name=os.path.join(mp, 'log2'))\n",
+    "\n",
+    "        # Filter cell_metadata for cells belonging to the current matrix file AND the target subclass\n",
+    "        # This is crucial because 'cell_metadata' contains info for all cells,\n",
+    "        # but 'ad' only contains data for cells within the current matrix file.\n",
+    "        cells_in_current_matrix = cell_metadata[cell_metadata['feature_matrix_label'] == mp]\n",
+    "        target_cells_in_matrix = cells_in_current_matrix[\n",
+    "            cells_in_current_matrix[target_grouping_level] == target_grouping_label\n",
+    "        ]\n",
+    "\n",
+    "        # Get the cell IDs (index) that match both criteria\n",
+    "        cell_ids_to_extract = target_cells_in_matrix.index\n",
+    "\n",
+    "        if not cell_ids_to_extract.empty:\n",
+    "            start_time = time.process_time()\n",
+    "            try:\n",
+    "                # Read the AnnData object in backed mode for efficiency\n",
+    "                current_adata = ad.read_h5ad(file_path, backed='r')\n",
+    "\n",
+    "                # Subset the AnnData object to include only the target cells and ALL genes\n",
+    "                # Use .to_memory() to load the subset into RAM before appending\n",
+    "                subset_adata = current_adata[cell_ids_to_extract, :].to_memory()\n",
+    "\n",
+    "                # Merge additional metadata columns into subset_adata.obs\n",
+    "                if additional_metadata_cols:\n",
+    "                    for col in additional_metadata_cols:\n",
+    "                        if col in target_cells_in_matrix.columns:\n",
+    "                            # Ensure indices align for merging\n",
+    "                            subset_adata.obs[col] = target_cells_in_matrix.loc[subset_adata.obs_names, col]\n",
+    "                        else:\n",
+    "                            print(f\" - Warning: Column '{col}' not found in cell_metadata. Skipping.\")\n",
+    "\n",
+    "                extracted_adata_list.append(subset_adata)\n",
+    "                print(f\" - Extracted {subset_adata.n_obs} cells from {mp}. Time taken: {time.process_time() - start_time:.2f} seconds\")\n",
+    "\n",
+    "            except FileNotFoundError:\n",
+    "                print(f\" - Warning: File not found for {mp} at {file_path}. Skipping.\")\n",
+    "            except Exception as e:\n",
+    "                print(f\" - Error processing {mp}: {e}. Skipping.\")\n",
+    "            finally:\n",
+    "                # Ensure the file handle is closed if opened in backed mode\n",
+    "                if 'current_adata' in locals() and current_adata.file is not None:\n",
+    "                    current_adata.file.close()\n",
+    "                del current_adata # Free up memory\n",
+    "        else:\n",
+    "            print(f\" - No cells found for grouping '{target_grouping_label}' in matrix {mp}. Skipping.\")\n",
+    "\n",
+    "        processed_count += 1\n",
+    "        if limit_matrices is not None and processed_count >= limit_matrices:\n",
+    "            print(f\"Stopping after processing {limit_matrices} matrices as requested.\")\n",
+    "            break\n",
+    "\n",
+    "    print(\"\\nConcatenating extracted AnnData objects...\")\n",
+    "    if extracted_adata_list:\n",
+    "        # Concatenate all collected AnnData objects\n",
+    "        # 'join='outer'' and 'merge='unique'' ensure all genes and unique obs columns are kept\n",
+    "        final_adata = ad.concat(extracted_adata_list, axis=0, join='outer', merge='unique')\n",
+    "        print(f\"Total cells extracted: {final_adata.n_obs}\")\n",
+    "        print(f\"Total genes retained: {final_adata.n_vars}\")\n",
+    "\n",
+    "        # Save the final AnnData object to a new h5ad file\n",
+    "        final_adata_path = output_filename\n",
+    "        final_adata.write(final_adata_path)\n",
+    "        print(f\"Extracted data saved to: {final_adata_path}\")\n",
+    "    else:\n",
+    "        print(\"No cells were extracted for the specified grouping level.\")\n",
+    "        final_adata_path = None\n",
+    "\n",
+    "    print(f\"Total time taken for extraction: {time.process_time() - total_start_time:.2f} seconds\")\n",
+    "    return final_adata_path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c104aa26",
+   "metadata": {},
+   "source": [
+    "Now define the target taxonomic level with `target_grouping_level` and the label of the group within that level with `target_grouping_label`.\n",
+    "\n",
+    "If you want to test the function on a subset of the input matrices, uncomment the `limit_matrices` line and it will break after the specified number of matrices. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "09767c03",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Define your target subclass and output filename\n",
+    "target_subclass = \"022 L5 ET CTX Glut\"\n",
+    "output_h5ad_file = \"l5_et_ctx_cells_all_genes.h5ad\"\n",
+    "\n",
+    "# Call the function\n",
+    "extracted_file = extract_subclass_to_h5ad(\n",
+    "    cell_metadata=cell_metadata,\n",
+    "    feature_matrices_info=matrices,\n",
+    "    target_grouping_level=\"subclass\",\n",
+    "    target_grouping_label=target_subclass,\n",
+    "    output_filename=output_h5ad_file,\n",
+    "    additional_metadata_cols=[ 'cell_barcode', # this can be any of the columns found in the `cell_metadata`\n",
+    "                                 'feature_matrix_label',\n",
+    "                                 'entity',\n",
+    "                                 'region_of_interest_acronym',\n",
+    "                                 'donor_sex',\n",
+    "                                 'dataset_label',\n",
+    "                                 'x',\n",
+    "                                 'y',\n",
+    "                                 'cluster_alias',\n",
+    "                                 'neurotransmitter',\n",
+    "                                 'class',\n",
+    "                                 'subclass',\n",
+    "                                 'supertype',\n",
+    "                                 'cluster',\n",
+    "                                 'neurotransmitter_color',\n",
+    "                                 'class_color',\n",
+    "                                 'subclass_color',\n",
+    "                                 'supertype_color',\n",
+    "                                 'cluster_color',\n",
+    "                                 'region_of_interest_order',\n",
+    "                                 'region_of_interest_color'], # Add columns you want to merge\n",
+    "    # data_directory=\"/path/to/your/abc_atlas_data\" # Uncomment if not using abc_atlas_cache for file paths\n",
+    "    #limit_matrices=5 # Uncomment for quick testing with a few matrices\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c703ceb1",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# You can then load and work with the new h5ad file:\n",
+    "if extracted_file:\n",
+    "    l5_et_adata = ad.read_h5ad(extracted_file)\n",
+    "    print(l5_et_adata)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f637ec7a",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# save the new h5ad\n",
+    "l5_et_adata.write_h5ad('../results/l5_et_adata.h5ad')"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}