[DPE-1448] Adjust MAF validation by rxu17 · Pull Request #1 · Sage-Bionetworks/iatlas-cbioportal-export

rxu17 · 2025-09-29T03:42:37Z

Problem:

We have new neoantigen variables: Peptide, HLA_Allele, MHCflurry_2.1.1_affinity_nm, MHCflurry_2.1.1_presentation_score that are being added to every maf dataset. Since we don't have a neoantigen data format (generic assay) anymore for cbioportal validator to validate, we will need to add in our own validation for it.

Depends on #2

Solution:

Add validation for the new columns + move around the code so it's in the new validate.py

Extras:

Removes code to create another folder inside the output folder on Synapse to store outputs (redundant)

Testing:

Unit tests
Tested on dataset with expected issues logged

dpulls · 2025-09-30T19:27:29Z

🎉 All dependencies have been resolved !

rxu17 · 2025-09-30T21:22:42Z

src/iatlascbioportalexport/load.py

    """
    # TODO: Make into argument
    dataset_dir = os.path.join(datahub_tools_path, "add-clinical-header", dataset_name)
-    # see if dataset_folder exists


Remove code as we just need to input the output folder synapse id directly instead of trying to create one inside the project

Copilot

Pull Request Overview

This PR adjusts MAF (Mutation Annotation Format) validation by moving validation logic to a dedicated validate.py module and adding support for new neoantigen-related columns. The changes centralize validation functionality and extend the required MAF column list to include neoantigen variables.

Moved MAF column validation from maf.py to validate.py module for better organization
Added four new neoantigen-related columns to the required MAF columns list
Removed redundant Synapse folder creation logic in the load module

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/iatlascbioportalexport/validate.py	Added REQUIRED_MAF_COLS constant with neoantigen columns and moved validation function from maf.py
src/iatlascbioportalexport/maf.py	Removed REQUIRED_MAF_COLS constant and validate_that_required_columns_are_present function
src/iatlascbioportalexport/load.py	Simplified Synapse storage by removing dataset folder creation logic
tests/test_validate.py	Added tests for the moved validation function and updated existing test parameters
tests/test_maf.py	Removed tests for the function that was moved to validate.py
src/iatlascbioportalexport/utils.py	Removed cbioportal_validator_output.txt from required output files
pyproject.toml	Added Python version constraint and pyyaml dependency
README.md	Fixed command line argument name in documentation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-01T21:41:04Z

pyproject.toml

 dependencies = [
  "synapseclient[pandas]>=4,<5",
  "pandas>=2.2",
+  "pyyaml=6.0"


The dependency specification uses '=' instead of the correct '==' operator for version pinning. This should be 'pyyaml==6.0' to properly specify the exact version.

Suggested change

"pyyaml=6.0"

"pyyaml==6.0"

Copilot · 2025-10-01T21:41:04Z

src/iatlascbioportalexport/validate.py

        )

+def validate_that_required_columns_are_present(
+    input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs


There's inconsistent spacing around the colon in the parameter 'dataset_file_name : str'. It should be 'dataset_file_name: str' to follow Python PEP 8 style guidelines.

Suggested change

input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs

input_df: pd.DataFrame, dataset_file_name: str, required_cols: list, **kwargs

Copilot · 2025-10-01T21:41:05Z

src/iatlascbioportalexport/validate.py

+        input_df = all_files["data_mutations.txt"], 
+        dataset_file_name="data_mutations.txt",
+        required_cols = REQUIRED_MAF_COLS,


Inconsistent spacing around assignment operators in function call arguments. Remove spaces around '=' for 'input_df' and 'required_cols' parameters to follow Python conventions.

Suggested change

input_df = all_files["data_mutations.txt"],

dataset_file_name="data_mutations.txt",

required_cols = REQUIRED_MAF_COLS,

input_df=all_files["data_mutations.txt"],

dataset_file_name="data_mutations.txt",

required_cols=REQUIRED_MAF_COLS,

danlu1 · 2025-10-01T22:51:42Z

src/iatlascbioportalexport/load.py

                f"{dataset_dir}/{file}",
                name=file,
-                parent=dataset_folder_id,
+                parent=output_folder_synid,


Do you want to save all files in the same level of repo?

Yes, that is the expected folder structure. Case list files get their own folder within the output folder

danlu1

LGTM just a few minor comments

danlu1 · 2025-10-01T23:23:27Z

src/iatlascbioportalexport/utils.py

    "meta_gene_signatures.txt",
    "data_rna_seq_mrna.txt",
    "meta_rna_seq_mrna.txt",
-    "cbioportal_validator_output.txt",


Did you track why this file is removed?

I think my comment got lost in the old orca-recipes repo but this is removed as it's not a required file we need to run validation on haha

rxu17 · 2025-10-01T23:42:57Z

Linking old closed PR in orca-recipes that I migrated here after I migrated the code: Sage-Bionetworks-Workflows/orca-recipes#125

Copilot

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-01T23:51:00Z

src/iatlascbioportalexport/validate.py

+    if set(required_cols) != set(list(input_df.columns)):
+        missing_cols = set(required_cols) - set(list(input_df.columns))


The validation logic is incorrect. This will fail if the input DataFrame has additional columns beyond the required ones. It should check if required columns are a subset of the DataFrame columns instead of exact equality.

Suggested change

if set(required_cols) != set(list(input_df.columns)):

missing_cols = set(required_cols) - set(list(input_df.columns))

if not set(required_cols).issubset(set(input_df.columns)):

missing_cols = set(required_cols) - set(input_df.columns)

Copilot · 2025-10-01T23:51:00Z

tests/test_validate.py

+            df, 
+            required_cols = validate.REQUIRED_MAF_COLS,
+            dataset_file_name = "data_mutations.txt")


Remove trailing whitespace after 'df,' on line 86 and use consistent spacing around '=' in function arguments.

Suggested change

df,

required_cols = validate.REQUIRED_MAF_COLS,

dataset_file_name = "data_mutations.txt")

df,

required_cols=validate.REQUIRED_MAF_COLS,

dataset_file_name="data_mutations.txt")

remove unused code and refactor maf validation

3817213

rxu17 requested a review from a team as a code owner September 29, 2025 03:42

rxu17 commented Sep 30, 2025

View reviewed changes

rxu17 requested a review from danlu1 October 1, 2025 20:43

correct readme, add python restriction

1fb194f

Copilot AI review requested due to automatic review settings October 1, 2025 21:40

Copilot AI reviewed Oct 1, 2025

View reviewed changes

Merge branch 'main' into dpe-1448-adjust-maf-validation

59cea84

danlu1 reviewed Oct 1, 2025

View reviewed changes

danlu1 approved these changes Oct 1, 2025

View reviewed changes

rxu17 mentioned this pull request Oct 1, 2025

[DPE-1470] Convert days to months #3

Merged

correct pyyaml version

06dea06

Copilot AI review requested due to automatic review settings October 1, 2025 23:50

Copilot AI reviewed Oct 1, 2025

View reviewed changes

rxu17 merged commit 75a6eb1 into main Oct 1, 2025
4 checks passed

rxu17 deleted the dpe-1448-adjust-maf-validation branch October 1, 2025 23:52

	input_df: pd.DataFrame, dataset_file_name : str, required_cols : list, **kwargs
	input_df: pd.DataFrame, dataset_file_name: str, required_cols: list, **kwargs

		if set(required_cols) != set(list(input_df.columns)):
		missing_cols = set(required_cols) - set(list(input_df.columns))

Conversation

rxu17 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Testing:

Uh oh!

dpulls bot commented Sep 30, 2025

Uh oh!

rxu17 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

danlu1 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

rxu17 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

danlu1 left a comment

Choose a reason for hiding this comment

Uh oh!

danlu1 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

rxu17 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

rxu17 commented Oct 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rxu17 commented Sep 29, 2025 •

edited

Loading