Skip to content

Conversation

@beatrizsaldana
Copy link
Member

@beatrizsaldana beatrizsaldana commented Sep 24, 2025

MG-43 - Implement RNA DE Aggregate ETL Transform

Problem

We need to generate a dataset to support the Model AD Explorer’s Gene Expression CT interface. This dataset needs to aggregate RNA differential expression data from multiple source files, with specific requirements for:

  • Filtering human genes (ENSG*) and keeping only mouse genes (ENSMUSG*)
  • Grouping data by gene, model, tissue, and sex
  • Creating age-based entries with log2_fc and adj_p_val values
  • Mapping gene metadata, biodomains, and model information
  • Special handling for JAX models (tissue name mapping)
  • Scientific notation formatting (5 significant digits)

Solution

New Transform Module

  • src/agoradatatools/etl/transform/rna_de_aggregate.py - Complete implementation of the RNA differential expression aggregation transform
  • tests/transform/test_rna_de_aggregate.py - Comprehensive test suite with 528 lines of test coverage
  • Test assets - Complete set of test data files in tests/test_assets/rna_de_aggregate/

New File on Synapse to Track Datafiles

I created rna_de_aggregate_data_files.csv to make it easier for us to track exactly which data files are being used for this transform and prevent us from having to make any repo changes to add or remove data files.

Key Features Implemented

Data Processing:

  • Gene filtering: Automatically filters out human genes (ENSG*) and keeps only mouse genes (ENSMUSG*)
  • Data aggregation: Groups by gene, model, tissue, and sex with age-based entries containing log2_fc and adj_p_val
  • Memory optimization: Processes files one at a time to reduce memory usage with garbage collection
  • Scientific notation handling: Rounds log2_fc and adj_p_val to 5 significant digits (no scientific notation)

Data Mapping & Transformations:

  • Gene metadata lookup: Maps ensembl_gene_id to gene_symbol using mouse_gene_metadata
  • Biodomain mapping: Associates genes with their biodomains using biodom_genes_mm data
  • Model information: Resolves model names, matched controls, and model types
  • JAX tissue mapping: Special handling for JAX models - maps "Right Cerebral Hemisphere" to "Hemibrain"
  • Age sorting: Numerically sorts age entries for consistent output

Data Validation:

  • Input validation: Validates required datasets and columns
  • File validation: Checks for empty files and missing required columns
  • Error handling: Comprehensive error messages for debugging

Output Structure

The transform generates JSON output with the following structure per gene/model/tissue/sex combination:

[
  {
    "ensembl_gene_id": "ENSMUSG00000000001",
    "gene_symbol": "Gnai3",
    "biodomains": [
      "Apoptosis",
      "Autophagy",
      "Cell Cycle",
      "Metal Binding and Homeostasis",
      "Oxidative Stress",
      "Proteostasis",
      "Proteostasis",
      "Proteostasis",
      "Structural Stabilization",
      "Synapse",
      "Vasculature"
    ],
    "name": "5xFAD (Jax/IU/Pitt)",
    "matched_control": "C57BL/6J",
    "model_group": null,
    "model_type": "Familial AD",
    "tissue": "Hemibrain",
    "sex": "Females",
    "4 months": {
      "log2_fc": 0.01167,
      "adj_p_val": 0.7812
    },
    "12 months": {
      "log2_fc": 0.0055394,
      "adj_p_val": 0.94876
    }
  },
  {
    "ensembl_gene_id": "ENSMUSG00000000001",
    "gene_symbol": "Gnai3",
    "biodomains": [
      "Apoptosis",
      "Autophagy",
      "Cell Cycle",
      "Metal Binding and Homeostasis",
      "Oxidative Stress",
      "Proteostasis",
      "Proteostasis",
      "Proteostasis",
      "Structural Stabilization",
      "Synapse",
      "Vasculature"
    ],
    "name": "5xFAD (Jax/IU/Pitt)",
    "matched_control": "C57BL/6J",
    "model_group": null,
    "model_type": "Familial AD",
    "tissue": "Hemibrain",
    "sex": "Females & Males",
    "4 months": {
      "log2_fc": 0.0012218,
      "adj_p_val": 0.97786
    },
    "12 months": {
      "log2_fc": 0.0071723,
      "adj_p_val": 0.89033
    }
  },
...

Performance Optimizations

  • Memory efficient: Processes one file at a time instead of loading all files into memory
  • Dictionary lookups: Pre-computed lookup dictionaries for O(1) data access
  • Garbage collection: Explicit memory cleanup after processing each file

Testing

Comprehensive Test Suite

Test Coverage Includes:

  • Happy path testing: Valid data transformation with expected output
  • Error handling: Missing datasets, empty files, missing columns
  • Gene filtering: Human vs mouse gene filtering
  • JAX tissue mapping: Special tissue name conversion
  • Age sorting: Numerical age ordering
  • Data validation: Input validation and error scenarios

Test Files Created:

  • Input test data files (CSV format)
  • Expected output JSON files
  • Edge case test scenarios

Test Classes:

  • TestTransformRnaDeAggregate - Main transform functionality tests
  • TestQuickValidateDataFile - Data validation utility tests

Test Scenarios

  1. Valid Data Transformation

    • Tests complete data flow with expected output
    • Validates all mapping and aggregation logic
  2. Error Handling

    • Missing required datasets
    • Empty data files
    • Missing required columns in data files
  3. Data Filtering

    • Human gene filtering (ENSG* → filtered out)
    • Mouse gene retention (ENSMUSG* → kept)
  4. Special Cases

    • JAX model tissue mapping ("Right Cerebral Hemisphere" → "Hemibrain")
    • Age entry numerical sorting
    • Biodomain association
    • Model information resolution
  5. Data Validation

    • Input dataset validation
    • File structure validation
    • Column requirement validation

@beatrizsaldana beatrizsaldana requested a review from a team as a code owner September 24, 2025 18:52
return obj


def input_validation_model_info(df: pd.DataFrame) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this out of disease_correlation. No need to review this, it has already been reviewed and approved.

def test_remove_duplicates_preserves_order(self):
input_list = ["a", "b", "a", "c", "b", "d"]
assert utils.remove_duplicates_keep_order(input_list) == ["a", "b", "c", "d"]

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this from the disease_correlation testing file to the utilities testing file. No need to review this, it has already been approved.


# Should take first element from the list
assert result["matched_control"] == "C57BL6J"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to the utilities testing file. No need to review.

return lookup


def input_validation_model_info(df: pd.DataFrame) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to utils.py. No need to review this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review changes to this file. Just moving a function from disease correlation to utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review changes to this file. Just moving a function from disease correlation to utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review changes to this file. Just moving a function from disease correlation to utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to review changes to this file. Just moving a function from disease correlation to utils.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new ETL transform for RNA differential expression (DE) aggregate data to support the Model AD Explorer's Gene Expression CT interface. The solution processes multiple RNA DE data files, aggregates them by gene/model/tissue/sex combinations, and creates age-based entries with statistical values.

  • Comprehensive new transform module with robust data validation and memory optimization
  • Complete test suite with 635 lines covering happy path, error handling, and edge cases
  • Utility function refactoring to improve code organization and reusability

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/agoradatatools/etl/transform/rna_de_aggregate.py New transform module implementing RNA DE aggregation logic with gene filtering, tissue mapping, and scientific notation handling
tests/transform/test_rna_de_aggregate.py Comprehensive test suite covering transform functionality, validation, and edge cases
src/agoradatatools/etl/utils.py Added input_validation_model_info utility function for model data consistency validation
src/agoradatatools/etl/transform/disease_correlation.py Removed duplicate validation function, now imports from utils
tests/transform/test_disease_correlation.py Removed tests for moved validation function
tests/test_utils.py Added tests for the relocated input_validation_model_info function
Multiple test asset files New test data files supporting various test scenarios
src/agoradatatools/etl/transform/__init__.py Added new transform to module exports
modelad_test_config.yaml Configuration for the new RNA DE aggregate dataset

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few issues surrounding data sets that have multiple different case/control pairs, plus some cleanup/readability suggestions. Excellent job on the optimizations though, I tested speed of lookup dicts vs just using pandas data frame functions/merges, and you definitely sped things up a lot.

Comment on lines 138 to 139
case = group.iloc[0]["case"]
control = group.iloc[0]["control"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines aren't making a valid assumption, which also messes up the rest of your code below. For example, see the Jax.IU.Pitt_APOE4.Trem2.R47H file: There are 3 different "case" genotypes per gene/model/tissue/sex, with 2 different ages each. This causes your age_entries code to loop through 6 entries instead of 2, and it keeps overwriting the 4 month and 12 month entries until it gets to the end. You end up with a final entry with logfc/padj that belongs to the genotype "APOE4-KI_WT; Trem2-R47H_homozygous" because that was the last genotype in the group, but name is assigned to "APOE4-KI_homozygous; Trem2-R47H_homozygous" because that was the first genotype in the group.

This same issue exists for the "control" variable for studies like UCI_ABCA7, which have 2 different controls.

You will need to extend the grouping to include case and control as well, as suggested above.

Suggested change
case = group.iloc[0]["case"]
control = group.iloc[0]["control"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has been fixed. I am not going to mark it as resolved because I don't want us to lose track of this. I'll mark it as resolved when we all agree it is resolved.

beatrizsaldana and others added 20 commits October 8, 2025 10:02
@sonarqubecloud
Copy link

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -0,0 +1,138 @@
# Synthetic Datasets for transform_rna_de_aggregate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!! Thank you for writing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants