-
Couldn't load subscription status.
- Fork 3
MG-43: Implement RNA DE Aggregate ETL Transform #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
| return obj | ||
|
|
||
|
|
||
| def input_validation_model_info(df: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this out of disease_correlation. No need to review this, it has already been reviewed and approved.
| def test_remove_duplicates_preserves_order(self): | ||
| input_list = ["a", "b", "a", "c", "b", "d"] | ||
| assert utils.remove_duplicates_keep_order(input_list) == ["a", "b", "c", "d"] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this from the disease_correlation testing file to the utilities testing file. No need to review this, it has already been approved.
|
|
||
| # Should take first element from the list | ||
| assert result["matched_control"] == "C57BL6J" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to the utilities testing file. No need to review.
| return lookup | ||
|
|
||
|
|
||
| def input_validation_model_info(df: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to utils.py. No need to review this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review changes to this file. Just moving a function from disease correlation to utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review changes to this file. Just moving a function from disease correlation to utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review changes to this file. Just moving a function from disease correlation to utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review changes to this file. Just moving a function from disease correlation to utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a new ETL transform for RNA differential expression (DE) aggregate data to support the Model AD Explorer's Gene Expression CT interface. The solution processes multiple RNA DE data files, aggregates them by gene/model/tissue/sex combinations, and creates age-based entries with statistical values.
- Comprehensive new transform module with robust data validation and memory optimization
- Complete test suite with 635 lines covering happy path, error handling, and edge cases
- Utility function refactoring to improve code organization and reusability
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/agoradatatools/etl/transform/rna_de_aggregate.py |
New transform module implementing RNA DE aggregation logic with gene filtering, tissue mapping, and scientific notation handling |
tests/transform/test_rna_de_aggregate.py |
Comprehensive test suite covering transform functionality, validation, and edge cases |
src/agoradatatools/etl/utils.py |
Added input_validation_model_info utility function for model data consistency validation |
src/agoradatatools/etl/transform/disease_correlation.py |
Removed duplicate validation function, now imports from utils |
tests/transform/test_disease_correlation.py |
Removed tests for moved validation function |
tests/test_utils.py |
Added tests for the relocated input_validation_model_info function |
| Multiple test asset files | New test data files supporting various test scenarios |
src/agoradatatools/etl/transform/__init__.py |
Added new transform to module exports |
modelad_test_config.yaml |
Configuration for the new RNA DE aggregate dataset |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a few issues surrounding data sets that have multiple different case/control pairs, plus some cleanup/readability suggestions. Excellent job on the optimizations though, I tested speed of lookup dicts vs just using pandas data frame functions/merges, and you definitely sped things up a lot.
| case = group.iloc[0]["case"] | ||
| control = group.iloc[0]["control"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines aren't making a valid assumption, which also messes up the rest of your code below. For example, see the Jax.IU.Pitt_APOE4.Trem2.R47H file: There are 3 different "case" genotypes per gene/model/tissue/sex, with 2 different ages each. This causes your age_entries code to loop through 6 entries instead of 2, and it keeps overwriting the 4 month and 12 month entries until it gets to the end. You end up with a final entry with logfc/padj that belongs to the genotype "APOE4-KI_WT; Trem2-R47H_homozygous" because that was the last genotype in the group, but name is assigned to "APOE4-KI_homozygous; Trem2-R47H_homozygous" because that was the first genotype in the group.
This same issue exists for the "control" variable for studies like UCI_ABCA7, which have 2 different controls.
You will need to extend the grouping to include case and control as well, as suggested above.
| case = group.iloc[0]["case"] | |
| control = group.iloc[0]["control"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this has been fixed. I am not going to mark it as resolved because I don't want us to lose track of this. I'll mark it as resolved when we all agree it is resolved.
…directly put them in the model ad yaml
Co-authored-by: jaclynbeck-sage <[email protected]>
Co-authored-by: jaclynbeck-sage <[email protected]>
Co-authored-by: jaclynbeck-sage <[email protected]>
Co-authored-by: jaclynbeck-sage <[email protected]>
Co-authored-by: jaclynbeck-sage <[email protected]>
Co-authored-by: jaclynbeck-sage <[email protected]>
…works/agora-data-tools into beatrizsaldana/MG-43
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| @@ -0,0 +1,138 @@ | |||
| # Synthetic Datasets for transform_rna_de_aggregate() | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing!! Thank you for writing this!



MG-43 - Implement RNA DE Aggregate ETL Transform
Problem
We need to generate a dataset to support the Model AD Explorer’s Gene Expression CT interface. This dataset needs to aggregate RNA differential expression data from multiple source files, with specific requirements for:
Solution
New Transform Module
src/agoradatatools/etl/transform/rna_de_aggregate.py- Complete implementation of the RNA differential expression aggregation transformtests/transform/test_rna_de_aggregate.py- Comprehensive test suite with 528 lines of test coveragetests/test_assets/rna_de_aggregate/New File on Synapse to Track Datafiles
I created rna_de_aggregate_data_files.csv to make it easier for us to track exactly which data files are being used for this transform and prevent us from having to make any repo changes to add or remove data files.
Key Features Implemented
Data Processing:
Data Mapping & Transformations:
Data Validation:
Output Structure
The transform generates JSON output with the following structure per gene/model/tissue/sex combination:
[ { "ensembl_gene_id": "ENSMUSG00000000001", "gene_symbol": "Gnai3", "biodomains": [ "Apoptosis", "Autophagy", "Cell Cycle", "Metal Binding and Homeostasis", "Oxidative Stress", "Proteostasis", "Proteostasis", "Proteostasis", "Structural Stabilization", "Synapse", "Vasculature" ], "name": "5xFAD (Jax/IU/Pitt)", "matched_control": "C57BL/6J", "model_group": null, "model_type": "Familial AD", "tissue": "Hemibrain", "sex": "Females", "4 months": { "log2_fc": 0.01167, "adj_p_val": 0.7812 }, "12 months": { "log2_fc": 0.0055394, "adj_p_val": 0.94876 } }, { "ensembl_gene_id": "ENSMUSG00000000001", "gene_symbol": "Gnai3", "biodomains": [ "Apoptosis", "Autophagy", "Cell Cycle", "Metal Binding and Homeostasis", "Oxidative Stress", "Proteostasis", "Proteostasis", "Proteostasis", "Structural Stabilization", "Synapse", "Vasculature" ], "name": "5xFAD (Jax/IU/Pitt)", "matched_control": "C57BL/6J", "model_group": null, "model_type": "Familial AD", "tissue": "Hemibrain", "sex": "Females & Males", "4 months": { "log2_fc": 0.0012218, "adj_p_val": 0.97786 }, "12 months": { "log2_fc": 0.0071723, "adj_p_val": 0.89033 } }, ...Performance Optimizations
Testing
Comprehensive Test Suite
Test Coverage Includes:
Test Files Created:
Test Classes:
TestTransformRnaDeAggregate- Main transform functionality testsTestQuickValidateDataFile- Data validation utility testsTest Scenarios
Valid Data Transformation
Error Handling
Data Filtering
Special Cases
Data Validation