Conversation
|
🎉 All dependencies have been resolved ! |
| """ | ||
| # TODO: Make into argument | ||
| dataset_dir = os.path.join(datahub_tools_path, "add-clinical-header", dataset_name) | ||
| # see if dataset_folder exists |
There was a problem hiding this comment.
Remove code as we just need to input the output folder synapse id directly instead of trying to create one inside the project
There was a problem hiding this comment.
Pull Request Overview
This PR adjusts MAF (Mutation Annotation Format) validation by moving validation logic to a dedicated validate.py module and adding support for new neoantigen-related columns. The changes centralize validation functionality and extend the required MAF column list to include neoantigen variables.
- Moved MAF column validation from maf.py to validate.py module for better organization
- Added four new neoantigen-related columns to the required MAF columns list
- Removed redundant Synapse folder creation logic in the load module
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iatlascbioportalexport/validate.py | Added REQUIRED_MAF_COLS constant with neoantigen columns and moved validation function from maf.py |
| src/iatlascbioportalexport/maf.py | Removed REQUIRED_MAF_COLS constant and validate_that_required_columns_are_present function |
| src/iatlascbioportalexport/load.py | Simplified Synapse storage by removing dataset folder creation logic |
| tests/test_validate.py | Added tests for the moved validation function and updated existing test parameters |
| tests/test_maf.py | Removed tests for the function that was moved to validate.py |
| src/iatlascbioportalexport/utils.py | Removed cbioportal_validator_output.txt from required output files |
| pyproject.toml | Added Python version constraint and pyyaml dependency |
| README.md | Fixed command line argument name in documentation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
pyproject.toml
Outdated
| dependencies = [ | ||
| "synapseclient[pandas]>=4,<5", | ||
| "pandas>=2.2", | ||
| "pyyaml=6.0" |
There was a problem hiding this comment.
The dependency specification uses '=' instead of the correct '==' operator for version pinning. This should be 'pyyaml==6.0' to properly specify the exact version.
| "pyyaml=6.0" | |
| "pyyaml==6.0" |
| ) | ||
|
|
||
| def validate_that_required_columns_are_present( | ||
| input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs |
There was a problem hiding this comment.
There's inconsistent spacing around the colon in the parameter 'dataset_file_name : str'. It should be 'dataset_file_name: str' to follow Python PEP 8 style guidelines.
| input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs | |
| input_df: pd.DataFrame, dataset_file_name: str, required_cols: list, **kwargs |
| input_df = all_files["data_mutations.txt"], | ||
| dataset_file_name="data_mutations.txt", | ||
| required_cols = REQUIRED_MAF_COLS, |
There was a problem hiding this comment.
Inconsistent spacing around assignment operators in function call arguments. Remove spaces around '=' for 'input_df' and 'required_cols' parameters to follow Python conventions.
| input_df = all_files["data_mutations.txt"], | |
| dataset_file_name="data_mutations.txt", | |
| required_cols = REQUIRED_MAF_COLS, | |
| input_df=all_files["data_mutations.txt"], | |
| dataset_file_name="data_mutations.txt", | |
| required_cols=REQUIRED_MAF_COLS, |
| f"{dataset_dir}/{file}", | ||
| name=file, | ||
| parent=dataset_folder_id, | ||
| parent=output_folder_synid, |
There was a problem hiding this comment.
Do you want to save all files in the same level of repo?
There was a problem hiding this comment.
Yes, that is the expected folder structure. Case list files get their own folder within the output folder
| "meta_gene_signatures.txt", | ||
| "data_rna_seq_mrna.txt", | ||
| "meta_rna_seq_mrna.txt", | ||
| "cbioportal_validator_output.txt", |
There was a problem hiding this comment.
I think my comment got lost in the old orca-recipes repo but this is removed as it's not a required file we need to run validation on haha
|
Linking old closed PR in |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if set(required_cols) != set(list(input_df.columns)): | ||
| missing_cols = set(required_cols) - set(list(input_df.columns)) |
There was a problem hiding this comment.
The validation logic is incorrect. This will fail if the input DataFrame has additional columns beyond the required ones. It should check if required columns are a subset of the DataFrame columns instead of exact equality.
| if set(required_cols) != set(list(input_df.columns)): | |
| missing_cols = set(required_cols) - set(list(input_df.columns)) | |
| if not set(required_cols).issubset(set(input_df.columns)): | |
| missing_cols = set(required_cols) - set(input_df.columns) |
| df, | ||
| required_cols = validate.REQUIRED_MAF_COLS, | ||
| dataset_file_name = "data_mutations.txt") |
There was a problem hiding this comment.
Remove trailing whitespace after 'df,' on line 86 and use consistent spacing around '=' in function arguments.
| df, | |
| required_cols = validate.REQUIRED_MAF_COLS, | |
| dataset_file_name = "data_mutations.txt") | |
| df, | |
| required_cols=validate.REQUIRED_MAF_COLS, | |
| dataset_file_name="data_mutations.txt") |
Problem:
We have new neoantigen variables:
Peptide, HLA_Allele, MHCflurry_2.1.1_affinity_nm, MHCflurry_2.1.1_presentation_scorethat are being added to every maf dataset. Since we don't have a neoantigen data format (generic assay) anymore for cbioportal validator to validate, we will need to add in our own validation for it.Depends on #2
Solution:
Add validation for the new columns + move around the code so it's in the new
validate.pyExtras:
Testing: