Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
363 changes: 363 additions & 0 deletions notebooks/10x_snRNASeq_tutorial_part_2c.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1a3a7f8f",
"metadata": {},
"source": [
"#10x RNA-seq gene expression data (part 2c)\n",
"\n",
"We perform subsetting of the Whole Mouse Brain Atlas at a particular taxonomic level. This allows us to create manageable `h5ad` files of certain branches of the taxonomy. \n",
"\n",
"You need to be connected to the internet to run this notebook and have run through the [getting started notebook](https://alleninstitute.github.io/abc_atlas_access/notebooks/getting_started.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb01de16",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pathlib import Path\n",
"import numpy as np\n",
"import anndata\n",
"import time\n",
"\n",
"from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache"
]
},
{
"cell_type": "markdown",
"id": "a72a3512",
"metadata": {},
"source": [
"We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.\n",
"\n",
"**Change the download_base variable to where you have downloaded the data in your system.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f24a0ff",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"download_base = Path('../../data/abc_atlas')\n",
"abc_cache = AbcProjectCache.from_cache_dir(download_base)\n",
"abc_cache.list_manifest_file_names"
]
},
{
"cell_type": "markdown",
"id": "cb0dc2e3",
"metadata": {},
"source": [
"Read in the expanded cell metadata table from the cache. Examples of creating this table are presented in part 1."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35186179",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# load cell metadata\n",
"cell_metadata = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata_with_cluster_annotation')\n",
"cell_metadata.set_index('cell_label',inplace=True)\n",
"cell_metadata.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7471b41f",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# list columns of the metadata\n",
"list(cell_metadata)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78a6149f",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# generate a table of matrices\n",
"matrices = cell.groupby(['dataset_label', 'feature_matrix_label'])[['library_label']].count()\n",
"matrices.columns = ['cell_count']\n",
"matrices"
]
},
{
"cell_type": "markdown",
"id": "47cb01b2",
"metadata": {},
"source": [
"Below we define a function to extract whole transcriptomic data from selected cells across all matrices in the database. We supply a taxonomic level (`target_grouping_level`) to perform the subsetting on and a `target_grouping_label` which represents the name of the group that we hope to capture."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9ae45f3",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"def extract_grouping_to_h5ad(\n",
" cell_metadata: pd.DataFrame,\n",
" feature_matrices_info: pd.DataFrame,\n",
" target_grouping_label: str,\n",
" target_grouping_level: str,\n",
" output_filename: str,\n",
" data_directory: str = None, # Optional: if not using abc_atlas_cache directly\n",
" limit_matrices: int = None, # Optional: for testing, limit number of matrices processed\n",
" additional_metadata_cols: list = None # New: List of column names from cell_metadata to add to .obs\n",
") -> str:\n",
" \"\"\"\n",
" Extracts all cells belonging to a specified subclass from multiple feature matrices\n",
" and saves them into a new AnnData (.h5ad) file, keeping all genes.\n",
" Optionally merges additional metadata columns from the cell_metadata DataFrame\n",
" into the .obs DataFrame of the resulting AnnData object.\n",
"\n",
" Args:\n",
" cell_metadata (pd.DataFrame): DataFrame containing cell metadata,\n",
" must include 'feature_matrix_label' and 'subclass_label' columns,\n",
" with cell IDs as index.\n",
" feature_matrices_info (pd.DataFrame): DataFrame containing information about\n",
" feature matrices, with (dataset_id, matrix_path) as index.\n",
" target_grouping_level (str): The level of the taxonomy that you want to select \n",
" from (e.g., 'subclass', 'supertype')\n",
" target_grouping_label (str): The specific subclass label to filter for\n",
" (e.g., \"022 L5 ET CTX Glut\").\n",
" output_filename (str): The name of the .h5ad file to save the extracted data to.\n",
" data_directory (str, optional): The base directory where the .h5ad files are located.\n",
" If None, abc_atlas_cache.get_data_path will be used. Defaults to None.\n",
" limit_matrices (int, optional): If provided, limits the number of feature matrices\n",
" to process. Useful for testing. Defaults to None.\n",
" additional_metadata_cols (list, optional): A list of column names from `cell_metadata`\n",
" to be added to the `.obs` DataFrame of the output AnnData object.\n",
" Defaults to None (no additional columns).\n",
"\n",
" Returns:\n",
" str: The path to the newly created .h5ad file.\n",
" \"\"\"\n",
"\n",
" extracted_adata_list = []\n",
" processed_count = 0\n",
" total_start_time = time.process_time()\n",
"\n",
" print(f\"Starting extraction for grouping: '{target_grouping_label}'...\")\n",
"\n",
" for mat_index in feature_matrices_info.index:\n",
" ds = mat_index[0] # dataset_id\n",
" mp = mat_index[1] # matrix_path (e.g., \"WMB-10x_nuclei-001/log2\")\n",
"\n",
" print(f\"Processing matrix: {mp}\")\n",
"\n",
" # Construct the full path to the h5ad file\n",
" if data_directory:\n",
" file_path = os.path.join(data_directory, mp, 'log2') # Assuming 'log2' is part of the filename\n",
" else:\n",
" # Use abc_atlas_cache if data_directory is not provided\n",
" file_path = abc_cache.get_data_path(directory=ds, file_name=os.path.join(mp, 'log2'))\n",
"\n",
" # Filter cell_metadata for cells belonging to the current matrix file AND the target subclass\n",
" # This is crucial because 'cell_metadata' contains info for all cells,\n",
" # but 'ad' only contains data for cells within the current matrix file.\n",
" cells_in_current_matrix = cell_metadata[cell_metadata['feature_matrix_label'] == mp]\n",
" target_cells_in_matrix = cells_in_current_matrix[\n",
" cells_in_current_matrix[target_grouping_level] == target_grouping_label\n",
" ]\n",
"\n",
" # Get the cell IDs (index) that match both criteria\n",
" cell_ids_to_extract = target_cells_in_matrix.index\n",
"\n",
" if not cell_ids_to_extract.empty:\n",
" start_time = time.process_time()\n",
" try:\n",
" # Read the AnnData object in backed mode for efficiency\n",
" current_adata = ad.read_h5ad(file_path, backed='r')\n",
"\n",
" # Subset the AnnData object to include only the target cells and ALL genes\n",
" # Use .to_memory() to load the subset into RAM before appending\n",
" subset_adata = current_adata[cell_ids_to_extract, :].to_memory()\n",
"\n",
" # Merge additional metadata columns into subset_adata.obs\n",
" if additional_metadata_cols:\n",
" for col in additional_metadata_cols:\n",
" if col in target_cells_in_matrix.columns:\n",
" # Ensure indices align for merging\n",
" subset_adata.obs[col] = target_cells_in_matrix.loc[subset_adata.obs_names, col]\n",
" else:\n",
" print(f\" - Warning: Column '{col}' not found in cell_metadata. Skipping.\")\n",
"\n",
" extracted_adata_list.append(subset_adata)\n",
" print(f\" - Extracted {subset_adata.n_obs} cells from {mp}. Time taken: {time.process_time() - start_time:.2f} seconds\")\n",
"\n",
" except FileNotFoundError:\n",
" print(f\" - Warning: File not found for {mp} at {file_path}. Skipping.\")\n",
" except Exception as e:\n",
" print(f\" - Error processing {mp}: {e}. Skipping.\")\n",
" finally:\n",
" # Ensure the file handle is closed if opened in backed mode\n",
" if 'current_adata' in locals() and current_adata.file is not None:\n",
" current_adata.file.close()\n",
" del current_adata # Free up memory\n",
" else:\n",
" print(f\" - No cells found for grouping '{target_grouping_label}' in matrix {mp}. Skipping.\")\n",
"\n",
" processed_count += 1\n",
" if limit_matrices is not None and processed_count >= limit_matrices:\n",
" print(f\"Stopping after processing {limit_matrices} matrices as requested.\")\n",
" break\n",
"\n",
" print(\"\\nConcatenating extracted AnnData objects...\")\n",
" if extracted_adata_list:\n",
" # Concatenate all collected AnnData objects\n",
" # 'join='outer'' and 'merge='unique'' ensure all genes and unique obs columns are kept\n",
" final_adata = ad.concat(extracted_adata_list, axis=0, join='outer', merge='unique')\n",
" print(f\"Total cells extracted: {final_adata.n_obs}\")\n",
" print(f\"Total genes retained: {final_adata.n_vars}\")\n",
"\n",
" # Save the final AnnData object to a new h5ad file\n",
" final_adata_path = output_filename\n",
" final_adata.write(final_adata_path)\n",
" print(f\"Extracted data saved to: {final_adata_path}\")\n",
" else:\n",
" print(\"No cells were extracted for the specified grouping level.\")\n",
" final_adata_path = None\n",
"\n",
" print(f\"Total time taken for extraction: {time.process_time() - total_start_time:.2f} seconds\")\n",
" return final_adata_path"
]
},
{
"cell_type": "markdown",
"id": "c104aa26",
"metadata": {},
"source": [
"Now define the target taxonomic level with `target_grouping_level` and the label of the group within that level with `target_grouping_label`.\n",
"\n",
"If you want to test the function on a subset of the input matrices, uncomment the `limit_matrices` line and it will break after the specified number of matrices. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09767c03",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# Define your target subclass and output filename\n",
"target_subclass = \"022 L5 ET CTX Glut\"\n",
"output_h5ad_file = \"l5_et_ctx_cells_all_genes.h5ad\"\n",
"\n",
"# Call the function\n",
"extracted_file = extract_subclass_to_h5ad(\n",
" cell_metadata=cell_metadata,\n",
" feature_matrices_info=matrices,\n",
" target_grouping_level=\"subclass\",\n",
" target_grouping_label=target_subclass,\n",
" output_filename=output_h5ad_file,\n",
" additional_metadata_cols=[ 'cell_barcode', # this can be any of the columns found in the `cell_metadata`\n",
" 'feature_matrix_label',\n",
" 'entity',\n",
" 'region_of_interest_acronym',\n",
" 'donor_sex',\n",
" 'dataset_label',\n",
" 'x',\n",
" 'y',\n",
" 'cluster_alias',\n",
" 'neurotransmitter',\n",
" 'class',\n",
" 'subclass',\n",
" 'supertype',\n",
" 'cluster',\n",
" 'neurotransmitter_color',\n",
" 'class_color',\n",
" 'subclass_color',\n",
" 'supertype_color',\n",
" 'cluster_color',\n",
" 'region_of_interest_order',\n",
" 'region_of_interest_color'], # Add columns you want to merge\n",
" # data_directory=\"/path/to/your/abc_atlas_data\" # Uncomment if not using abc_atlas_cache for file paths\n",
" #limit_matrices=5 # Uncomment for quick testing with a few matrices\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c703ceb1",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# You can then load and work with the new h5ad file:\n",
"if extracted_file:\n",
" l5_et_adata = ad.read_h5ad(extracted_file)\n",
" print(l5_et_adata)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f637ec7a",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"# save the new h5ad\n",
"l5_et_adata.write_h5ad('../results/l5_et_adata.h5ad')"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading