diff --git a/notebooks/10x_snRNASeq_tutorial_part_2c.ipynb b/notebooks/10x_snRNASeq_tutorial_part_2c.ipynb new file mode 100644 index 00000000..47a89490 --- /dev/null +++ b/notebooks/10x_snRNASeq_tutorial_part_2c.ipynb @@ -0,0 +1,363 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1a3a7f8f", + "metadata": {}, + "source": [ + "#10x RNA-seq gene expression data (part 2c)\n", + "\n", + "We perform subsetting of the Whole Mouse Brain Atlas at a particular taxonomic level. This allows us to create manageable `h5ad` files of certain branches of the taxonomy. \n", + "\n", + "You need to be connected to the internet to run this notebook and have run through the [getting started notebook](https://alleninstitute.github.io/abc_atlas_access/notebooks/getting_started.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb01de16", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from pathlib import Path\n", + "import numpy as np\n", + "import anndata\n", + "import time\n", + "\n", + "from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache" + ] + }, + { + "cell_type": "markdown", + "id": "a72a3512", + "metadata": {}, + "source": [ + "We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.\n", + "\n", + "**Change the download_base variable to where you have downloaded the data in your system.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f24a0ff", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "download_base = Path('../../data/abc_atlas')\n", + "abc_cache = AbcProjectCache.from_cache_dir(download_base)\n", + "abc_cache.list_manifest_file_names" + ] + }, + { + "cell_type": "markdown", + "id": "cb0dc2e3", + "metadata": {}, + "source": [ + "Read in the expanded cell metadata table from the cache. Examples of creating this table are presented in part 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35186179", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# load cell metadata\n", + "cell_metadata = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata_with_cluster_annotation')\n", + "cell_metadata.set_index('cell_label',inplace=True)\n", + "cell_metadata.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7471b41f", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# list columns of the metadata\n", + "list(cell_metadata)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78a6149f", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# generate a table of matrices\n", + "matrices = cell.groupby(['dataset_label', 'feature_matrix_label'])[['library_label']].count()\n", + "matrices.columns = ['cell_count']\n", + "matrices" + ] + }, + { + "cell_type": "markdown", + "id": "47cb01b2", + "metadata": {}, + "source": [ + "Below we define a function to extract whole transcriptomic data from selected cells across all matrices in the database. We supply a taxonomic level (`target_grouping_level`) to perform the subsetting on and a `target_grouping_label` which represents the name of the group that we hope to capture." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9ae45f3", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "def extract_grouping_to_h5ad(\n", + " cell_metadata: pd.DataFrame,\n", + " feature_matrices_info: pd.DataFrame,\n", + " target_grouping_label: str,\n", + " target_grouping_level: str,\n", + " output_filename: str,\n", + " data_directory: str = None, # Optional: if not using abc_atlas_cache directly\n", + " limit_matrices: int = None, # Optional: for testing, limit number of matrices processed\n", + " additional_metadata_cols: list = None # New: List of column names from cell_metadata to add to .obs\n", + ") -> str:\n", + " \"\"\"\n", + " Extracts all cells belonging to a specified subclass from multiple feature matrices\n", + " and saves them into a new AnnData (.h5ad) file, keeping all genes.\n", + " Optionally merges additional metadata columns from the cell_metadata DataFrame\n", + " into the .obs DataFrame of the resulting AnnData object.\n", + "\n", + " Args:\n", + " cell_metadata (pd.DataFrame): DataFrame containing cell metadata,\n", + " must include 'feature_matrix_label' and 'subclass_label' columns,\n", + " with cell IDs as index.\n", + " feature_matrices_info (pd.DataFrame): DataFrame containing information about\n", + " feature matrices, with (dataset_id, matrix_path) as index.\n", + " target_grouping_level (str): The level of the taxonomy that you want to select \n", + " from (e.g., 'subclass', 'supertype')\n", + " target_grouping_label (str): The specific subclass label to filter for\n", + " (e.g., \"022 L5 ET CTX Glut\").\n", + " output_filename (str): The name of the .h5ad file to save the extracted data to.\n", + " data_directory (str, optional): The base directory where the .h5ad files are located.\n", + " If None, abc_atlas_cache.get_data_path will be used. Defaults to None.\n", + " limit_matrices (int, optional): If provided, limits the number of feature matrices\n", + " to process. Useful for testing. Defaults to None.\n", + " additional_metadata_cols (list, optional): A list of column names from `cell_metadata`\n", + " to be added to the `.obs` DataFrame of the output AnnData object.\n", + " Defaults to None (no additional columns).\n", + "\n", + " Returns:\n", + " str: The path to the newly created .h5ad file.\n", + " \"\"\"\n", + "\n", + " extracted_adata_list = []\n", + " processed_count = 0\n", + " total_start_time = time.process_time()\n", + "\n", + " print(f\"Starting extraction for grouping: '{target_grouping_label}'...\")\n", + "\n", + " for mat_index in feature_matrices_info.index:\n", + " ds = mat_index[0] # dataset_id\n", + " mp = mat_index[1] # matrix_path (e.g., \"WMB-10x_nuclei-001/log2\")\n", + "\n", + " print(f\"Processing matrix: {mp}\")\n", + "\n", + " # Construct the full path to the h5ad file\n", + " if data_directory:\n", + " file_path = os.path.join(data_directory, mp, 'log2') # Assuming 'log2' is part of the filename\n", + " else:\n", + " # Use abc_atlas_cache if data_directory is not provided\n", + " file_path = abc_cache.get_data_path(directory=ds, file_name=os.path.join(mp, 'log2'))\n", + "\n", + " # Filter cell_metadata for cells belonging to the current matrix file AND the target subclass\n", + " # This is crucial because 'cell_metadata' contains info for all cells,\n", + " # but 'ad' only contains data for cells within the current matrix file.\n", + " cells_in_current_matrix = cell_metadata[cell_metadata['feature_matrix_label'] == mp]\n", + " target_cells_in_matrix = cells_in_current_matrix[\n", + " cells_in_current_matrix[target_grouping_level] == target_grouping_label\n", + " ]\n", + "\n", + " # Get the cell IDs (index) that match both criteria\n", + " cell_ids_to_extract = target_cells_in_matrix.index\n", + "\n", + " if not cell_ids_to_extract.empty:\n", + " start_time = time.process_time()\n", + " try:\n", + " # Read the AnnData object in backed mode for efficiency\n", + " current_adata = ad.read_h5ad(file_path, backed='r')\n", + "\n", + " # Subset the AnnData object to include only the target cells and ALL genes\n", + " # Use .to_memory() to load the subset into RAM before appending\n", + " subset_adata = current_adata[cell_ids_to_extract, :].to_memory()\n", + "\n", + " # Merge additional metadata columns into subset_adata.obs\n", + " if additional_metadata_cols:\n", + " for col in additional_metadata_cols:\n", + " if col in target_cells_in_matrix.columns:\n", + " # Ensure indices align for merging\n", + " subset_adata.obs[col] = target_cells_in_matrix.loc[subset_adata.obs_names, col]\n", + " else:\n", + " print(f\" - Warning: Column '{col}' not found in cell_metadata. Skipping.\")\n", + "\n", + " extracted_adata_list.append(subset_adata)\n", + " print(f\" - Extracted {subset_adata.n_obs} cells from {mp}. Time taken: {time.process_time() - start_time:.2f} seconds\")\n", + "\n", + " except FileNotFoundError:\n", + " print(f\" - Warning: File not found for {mp} at {file_path}. Skipping.\")\n", + " except Exception as e:\n", + " print(f\" - Error processing {mp}: {e}. Skipping.\")\n", + " finally:\n", + " # Ensure the file handle is closed if opened in backed mode\n", + " if 'current_adata' in locals() and current_adata.file is not None:\n", + " current_adata.file.close()\n", + " del current_adata # Free up memory\n", + " else:\n", + " print(f\" - No cells found for grouping '{target_grouping_label}' in matrix {mp}. Skipping.\")\n", + "\n", + " processed_count += 1\n", + " if limit_matrices is not None and processed_count >= limit_matrices:\n", + " print(f\"Stopping after processing {limit_matrices} matrices as requested.\")\n", + " break\n", + "\n", + " print(\"\\nConcatenating extracted AnnData objects...\")\n", + " if extracted_adata_list:\n", + " # Concatenate all collected AnnData objects\n", + " # 'join='outer'' and 'merge='unique'' ensure all genes and unique obs columns are kept\n", + " final_adata = ad.concat(extracted_adata_list, axis=0, join='outer', merge='unique')\n", + " print(f\"Total cells extracted: {final_adata.n_obs}\")\n", + " print(f\"Total genes retained: {final_adata.n_vars}\")\n", + "\n", + " # Save the final AnnData object to a new h5ad file\n", + " final_adata_path = output_filename\n", + " final_adata.write(final_adata_path)\n", + " print(f\"Extracted data saved to: {final_adata_path}\")\n", + " else:\n", + " print(\"No cells were extracted for the specified grouping level.\")\n", + " final_adata_path = None\n", + "\n", + " print(f\"Total time taken for extraction: {time.process_time() - total_start_time:.2f} seconds\")\n", + " return final_adata_path" + ] + }, + { + "cell_type": "markdown", + "id": "c104aa26", + "metadata": {}, + "source": [ + "Now define the target taxonomic level with `target_grouping_level` and the label of the group within that level with `target_grouping_label`.\n", + "\n", + "If you want to test the function on a subset of the input matrices, uncomment the `limit_matrices` line and it will break after the specified number of matrices. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09767c03", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Define your target subclass and output filename\n", + "target_subclass = \"022 L5 ET CTX Glut\"\n", + "output_h5ad_file = \"l5_et_ctx_cells_all_genes.h5ad\"\n", + "\n", + "# Call the function\n", + "extracted_file = extract_subclass_to_h5ad(\n", + " cell_metadata=cell_metadata,\n", + " feature_matrices_info=matrices,\n", + " target_grouping_level=\"subclass\",\n", + " target_grouping_label=target_subclass,\n", + " output_filename=output_h5ad_file,\n", + " additional_metadata_cols=[ 'cell_barcode', # this can be any of the columns found in the `cell_metadata`\n", + " 'feature_matrix_label',\n", + " 'entity',\n", + " 'region_of_interest_acronym',\n", + " 'donor_sex',\n", + " 'dataset_label',\n", + " 'x',\n", + " 'y',\n", + " 'cluster_alias',\n", + " 'neurotransmitter',\n", + " 'class',\n", + " 'subclass',\n", + " 'supertype',\n", + " 'cluster',\n", + " 'neurotransmitter_color',\n", + " 'class_color',\n", + " 'subclass_color',\n", + " 'supertype_color',\n", + " 'cluster_color',\n", + " 'region_of_interest_order',\n", + " 'region_of_interest_color'], # Add columns you want to merge\n", + " # data_directory=\"/path/to/your/abc_atlas_data\" # Uncomment if not using abc_atlas_cache for file paths\n", + " #limit_matrices=5 # Uncomment for quick testing with a few matrices\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c703ceb1", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# You can then load and work with the new h5ad file:\n", + "if extracted_file:\n", + " l5_et_adata = ad.read_h5ad(extracted_file)\n", + " print(l5_et_adata)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f637ec7a", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# save the new h5ad\n", + "l5_et_adata.write_h5ad('../results/l5_et_adata.h5ad')" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}