Skip to content

[DPE-1448] Adjust MAF validation#1

Merged
rxu17 merged 4 commits intomainfrom
dpe-1448-adjust-maf-validation
Oct 1, 2025
Merged

[DPE-1448] Adjust MAF validation#1
rxu17 merged 4 commits intomainfrom
dpe-1448-adjust-maf-validation

Conversation

@rxu17
Copy link
Collaborator

@rxu17 rxu17 commented Sep 29, 2025

Problem:

We have new neoantigen variables: Peptide, HLA_Allele, MHCflurry_2.1.1_affinity_nm, MHCflurry_2.1.1_presentation_score that are being added to every maf dataset. Since we don't have a neoantigen data format (generic assay) anymore for cbioportal validator to validate, we will need to add in our own validation for it.

Depends on #2

Solution:

Add validation for the new columns + move around the code so it's in the new validate.py

Extras:

  • Removes code to create another folder inside the output folder on Synapse to store outputs (redundant)

Testing:

  • Unit tests
  • Tested on dataset with expected issues logged

@rxu17 rxu17 requested a review from a team as a code owner September 29, 2025 03:42
@dpulls
Copy link

dpulls bot commented Sep 30, 2025

🎉 All dependencies have been resolved !

"""
# TODO: Make into argument
dataset_dir = os.path.join(datahub_tools_path, "add-clinical-header", dataset_name)
# see if dataset_folder exists
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove code as we just need to input the output folder synapse id directly instead of trying to create one inside the project

@rxu17 rxu17 requested a review from danlu1 October 1, 2025 20:43
Copilot AI review requested due to automatic review settings October 1, 2025 21:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adjusts MAF (Mutation Annotation Format) validation by moving validation logic to a dedicated validate.py module and adding support for new neoantigen-related columns. The changes centralize validation functionality and extend the required MAF column list to include neoantigen variables.

  • Moved MAF column validation from maf.py to validate.py module for better organization
  • Added four new neoantigen-related columns to the required MAF columns list
  • Removed redundant Synapse folder creation logic in the load module

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/iatlascbioportalexport/validate.py Added REQUIRED_MAF_COLS constant with neoantigen columns and moved validation function from maf.py
src/iatlascbioportalexport/maf.py Removed REQUIRED_MAF_COLS constant and validate_that_required_columns_are_present function
src/iatlascbioportalexport/load.py Simplified Synapse storage by removing dataset folder creation logic
tests/test_validate.py Added tests for the moved validation function and updated existing test parameters
tests/test_maf.py Removed tests for the function that was moved to validate.py
src/iatlascbioportalexport/utils.py Removed cbioportal_validator_output.txt from required output files
pyproject.toml Added Python version constraint and pyyaml dependency
README.md Fixed command line argument name in documentation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

pyproject.toml Outdated
dependencies = [
"synapseclient[pandas]>=4,<5",
"pandas>=2.2",
"pyyaml=6.0"
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency specification uses '=' instead of the correct '==' operator for version pinning. This should be 'pyyaml==6.0' to properly specify the exact version.

Suggested change
"pyyaml=6.0"
"pyyaml==6.0"

Copilot uses AI. Check for mistakes.
)

def validate_that_required_columns_are_present(
input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's inconsistent spacing around the colon in the parameter 'dataset_file_name : str'. It should be 'dataset_file_name: str' to follow Python PEP 8 style guidelines.

Suggested change
input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs
input_df: pd.DataFrame, dataset_file_name: str, required_cols: list, **kwargs

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +271
input_df = all_files["data_mutations.txt"],
dataset_file_name="data_mutations.txt",
required_cols = REQUIRED_MAF_COLS,
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operators in function call arguments. Remove spaces around '=' for 'input_df' and 'required_cols' parameters to follow Python conventions.

Suggested change
input_df = all_files["data_mutations.txt"],
dataset_file_name="data_mutations.txt",
required_cols = REQUIRED_MAF_COLS,
input_df=all_files["data_mutations.txt"],
dataset_file_name="data_mutations.txt",
required_cols=REQUIRED_MAF_COLS,

Copilot uses AI. Check for mistakes.
f"{dataset_dir}/{file}",
name=file,
parent=dataset_folder_id,
parent=output_folder_synid,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to save all files in the same level of repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the expected folder structure. Case list files get their own folder within the output folder

Copy link

@danlu1 danlu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few minor comments

"meta_gene_signatures.txt",
"data_rna_seq_mrna.txt",
"meta_rna_seq_mrna.txt",
"cbioportal_validator_output.txt",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you track why this file is removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comment got lost in the old orca-recipes repo but this is removed as it's not a required file we need to run validation on haha

@rxu17
Copy link
Collaborator Author

rxu17 commented Oct 1, 2025

Linking old closed PR in orca-recipes that I migrated here after I migrated the code: Sage-Bionetworks-Workflows/orca-recipes#125

Copilot AI review requested due to automatic review settings October 1, 2025 23:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +170 to +171
if set(required_cols) != set(list(input_df.columns)):
missing_cols = set(required_cols) - set(list(input_df.columns))
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic is incorrect. This will fail if the input DataFrame has additional columns beyond the required ones. It should check if required columns are a subset of the DataFrame columns instead of exact equality.

Suggested change
if set(required_cols) != set(list(input_df.columns)):
missing_cols = set(required_cols) - set(list(input_df.columns))
if not set(required_cols).issubset(set(input_df.columns)):
missing_cols = set(required_cols) - set(input_df.columns)

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +88
df,
required_cols = validate.REQUIRED_MAF_COLS,
dataset_file_name = "data_mutations.txt")
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace after 'df,' on line 86 and use consistent spacing around '=' in function arguments.

Suggested change
df,
required_cols = validate.REQUIRED_MAF_COLS,
dataset_file_name = "data_mutations.txt")
df,
required_cols=validate.REQUIRED_MAF_COLS,
dataset_file_name="data_mutations.txt")

Copilot uses AI. Check for mistakes.
@rxu17 rxu17 merged commit 75a6eb1 into main Oct 1, 2025
4 checks passed
@rxu17 rxu17 deleted the dpe-1448-adjust-maf-validation branch October 1, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants