MG-43: Implement RNA DE Aggregate ETL Transform #229

beatrizsaldana · 2025-09-24T18:52:43Z

MG-43 - Implement RNA DE Aggregate ETL Transform

Problem

We need to generate a dataset to support the Model AD Explorer’s Gene Expression CT interface. This dataset needs to aggregate RNA differential expression data from multiple source files, with specific requirements for:

Filtering human genes (ENSG*) and keeping only mouse genes (ENSMUSG*)
Grouping data by gene, model, tissue, and sex
Creating age-based entries with log2_fc and adj_p_val values
Mapping gene metadata, biodomains, and model information
Special handling for JAX models (tissue name mapping)
Scientific notation formatting (5 significant digits)

Solution

New Transform Module

src/agoradatatools/etl/transform/rna_de_aggregate.py - Complete implementation of the RNA differential expression aggregation transform
tests/transform/test_rna_de_aggregate.py - Comprehensive test suite with 528 lines of test coverage
Test assets - Complete set of test data files in tests/test_assets/rna_de_aggregate/

New File on Synapse to Track Datafiles

I created rna_de_aggregate_data_files.csv to make it easier for us to track exactly which data files are being used for this transform and prevent us from having to make any repo changes to add or remove data files.

Key Features Implemented

Data Processing:

Gene filtering: Automatically filters out human genes (ENSG*) and keeps only mouse genes (ENSMUSG*)
Data aggregation: Groups by gene, model, tissue, and sex with age-based entries containing log2_fc and adj_p_val
Memory optimization: Processes files one at a time to reduce memory usage with garbage collection
Scientific notation handling: Rounds log2_fc and adj_p_val to 5 significant digits (no scientific notation)

Data Mapping & Transformations:

Gene metadata lookup: Maps ensembl_gene_id to gene_symbol using mouse_gene_metadata
Biodomain mapping: Associates genes with their biodomains using biodom_genes_mm data
Model information: Resolves model names, matched controls, and model types
JAX tissue mapping: Special handling for JAX models - maps "Right Cerebral Hemisphere" to "Hemibrain"
Age sorting: Numerically sorts age entries for consistent output

Data Validation:

Input validation: Validates required datasets and columns
File validation: Checks for empty files and missing required columns
Error handling: Comprehensive error messages for debugging

Output Structure

The transform generates JSON output with the following structure per gene/model/tissue/sex combination:

[
  {
    "ensembl_gene_id": "ENSMUSG00000000001",
    "gene_symbol": "Gnai3",
    "biodomains": [
      "Apoptosis",
      "Autophagy",
      "Cell Cycle",
      "Metal Binding and Homeostasis",
      "Oxidative Stress",
      "Proteostasis",
      "Proteostasis",
      "Proteostasis",
      "Structural Stabilization",
      "Synapse",
      "Vasculature"
    ],
    "name": "5xFAD (Jax/IU/Pitt)",
    "matched_control": "C57BL/6J",
    "model_group": null,
    "model_type": "Familial AD",
    "tissue": "Hemibrain",
    "sex": "Females",
    "4 months": {
      "log2_fc": 0.01167,
      "adj_p_val": 0.7812
    },
    "12 months": {
      "log2_fc": 0.0055394,
      "adj_p_val": 0.94876
    }
  },
  {
    "ensembl_gene_id": "ENSMUSG00000000001",
    "gene_symbol": "Gnai3",
    "biodomains": [
      "Apoptosis",
      "Autophagy",
      "Cell Cycle",
      "Metal Binding and Homeostasis",
      "Oxidative Stress",
      "Proteostasis",
      "Proteostasis",
      "Proteostasis",
      "Structural Stabilization",
      "Synapse",
      "Vasculature"
    ],
    "name": "5xFAD (Jax/IU/Pitt)",
    "matched_control": "C57BL/6J",
    "model_group": null,
    "model_type": "Familial AD",
    "tissue": "Hemibrain",
    "sex": "Females & Males",
    "4 months": {
      "log2_fc": 0.0012218,
      "adj_p_val": 0.97786
    },
    "12 months": {
      "log2_fc": 0.0071723,
      "adj_p_val": 0.89033
    }
  },
...

Performance Optimizations

Memory efficient: Processes one file at a time instead of loading all files into memory
Dictionary lookups: Pre-computed lookup dictionaries for O(1) data access
Garbage collection: Explicit memory cleanup after processing each file

Testing

Comprehensive Test Suite

Test Coverage Includes:

Happy path testing: Valid data transformation with expected output
Error handling: Missing datasets, empty files, missing columns
Gene filtering: Human vs mouse gene filtering
JAX tissue mapping: Special tissue name conversion
Age sorting: Numerical age ordering
Data validation: Input validation and error scenarios

Test Files Created:

Input test data files (CSV format)
Expected output JSON files
Edge case test scenarios

Test Classes:

TestTransformRnaDeAggregate - Main transform functionality tests
TestQuickValidateDataFile - Data validation utility tests

Test Scenarios

Valid Data Transformation
- Tests complete data flow with expected output
- Validates all mapping and aggregation logic
Error Handling
- Missing required datasets
- Empty data files
- Missing required columns in data files
Data Filtering
- Human gene filtering (ENSG* → filtered out)
- Mouse gene retention (ENSMUSG* → kept)
Special Cases
- JAX model tissue mapping ("Right Cerebral Hemisphere" → "Hemibrain")
- Age entry numerical sorting
- Biodomain association
- Model information resolution
Data Validation
- Input dataset validation
- File structure validation
- Column requirement validation

… source

beatrizsaldana · 2025-09-24T18:55:59Z

src/agoradatatools/etl/utils.py

    return obj
+
+
+def input_validation_model_info(df: pd.DataFrame) -> None:


I just moved this out of disease_correlation. No need to review this, it has already been reviewed and approved.

beatrizsaldana · 2025-09-24T18:56:51Z

tests/test_utils.py

    def test_remove_duplicates_preserves_order(self):
        input_list = ["a", "b", "a", "c", "b", "d"]
        assert utils.remove_duplicates_keep_order(input_list) == ["a", "b", "c", "d"]
+


I just moved this from the disease_correlation testing file to the utilities testing file. No need to review this, it has already been approved.

beatrizsaldana · 2025-09-24T18:57:27Z

tests/transform/test_disease_correlation.py


        # Should take first element from the list
        assert result["matched_control"] == "C57BL6J"
-


Moved this to the utilities testing file. No need to review.

beatrizsaldana · 2025-09-24T18:57:50Z

src/agoradatatools/etl/transform/disease_correlation.py

    return lookup


-def input_validation_model_info(df: pd.DataFrame) -> None:


Moved this to utils.py. No need to review this.

beatrizsaldana · 2025-09-24T18:58:16Z

src/agoradatatools/etl/transform/disease_correlation.py

No need to review changes to this file. Just moving a function from disease correlation to utils.

beatrizsaldana · 2025-09-24T18:58:45Z

src/agoradatatools/etl/utils.py

No need to review changes to this file. Just moving a function from disease correlation to utils.

beatrizsaldana · 2025-09-24T18:58:58Z

tests/test_utils.py

No need to review changes to this file. Just moving a function from disease correlation to utils.

beatrizsaldana · 2025-09-24T18:59:08Z

tests/transform/test_disease_correlation.py

No need to review changes to this file. Just moving a function from disease correlation to utils.

Copilot

Pull Request Overview

This PR implements a new ETL transform for RNA differential expression (DE) aggregate data to support the Model AD Explorer's Gene Expression CT interface. The solution processes multiple RNA DE data files, aggregates them by gene/model/tissue/sex combinations, and creates age-based entries with statistical values.

Comprehensive new transform module with robust data validation and memory optimization
Complete test suite with 635 lines covering happy path, error handling, and edge cases
Utility function refactoring to improve code organization and reusability

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/agoradatatools/etl/transform/rna_de_aggregate.py`	New transform module implementing RNA DE aggregation logic with gene filtering, tissue mapping, and scientific notation handling
`tests/transform/test_rna_de_aggregate.py`	Comprehensive test suite covering transform functionality, validation, and edge cases
`src/agoradatatools/etl/utils.py`	Added `input_validation_model_info` utility function for model data consistency validation
`src/agoradatatools/etl/transform/disease_correlation.py`	Removed duplicate validation function, now imports from utils
`tests/transform/test_disease_correlation.py`	Removed tests for moved validation function
`tests/test_utils.py`	Added tests for the relocated `input_validation_model_info` function
Multiple test asset files	New test data files supporting various test scenarios
`src/agoradatatools/etl/transform/__init__.py`	Added new transform to module exports
`modelad_test_config.yaml`	Configuration for the new RNA DE aggregate dataset

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}