fixes to ingest all HGNC genes for uniprot mappings #290

bsantan · 2025-02-25T17:12:01Z

HGNC gene IDs were only being mapped for those that exist in the MONDO ontology
Now ingest the HGNC OWL file, and map all human Uniprot IDs to HGNC IDs

Copilot

Pull Request Overview

This PR transitions from using MONDO ontology for HGNC gene mappings to directly using Node Normalizer API to map all human UniProt proteins to their corresponding HGNC gene IDs. The change addresses the limitation where only genes present in MONDO were being mapped, now enabling complete HGNC gene coverage for human UniProt entries.

Key changes:

Replaced MONDO-based gene mapping with Node Normalizer API calls
Added HGNC ontology processing to support comprehensive gene mappings
Updated Node Normalizer API endpoint to version 1.5

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
kg_microbe/utils/uniprot_utils.py	Implements Node Normalizer API function and updates gene parsing logic
kg_microbe/transform_utils/wallen_etal/wallen_etal.py	Adds debugging statement
kg_microbe/transform_utils/uniprot_human/uniprot_human.py	Removes mondo_gene_dict parameter from function calls
kg_microbe/transform_utils/ontologies/ontologies_transform.py	Adds HGNC ontology processing and filters out gene property nodes
kg_microbe/transform_utils/constants.py	Adds HGNC-related prefixes and updates Node Normalizer URL

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-13T17:09:10Z

kg_microbe/transform_utils/wallen_etal/wallen_etal.py

        wallen_etal_df = pd.read_excel(input_file, skiprows=3, sheet_name=WALLEN_ETAL_TAB_NAME)
        wallen_etal_df[FDR_COLUMN] = pd.to_numeric(wallen_etal_df[FDR_COLUMN], errors="coerce")
-
+        import pdb;pdb.set_trace()


Debug statement with pdb.set_trace() should not be committed to production code. This will cause the program to halt execution and wait for debugger input.

Suggested change

import pdb;pdb.set_trace()

Copilot · 2025-08-13T17:09:11Z

kg_microbe/utils/uniprot_utils.py

 CHEBI_REGEX = re.compile(r'/ligand_id="ChEBI:(.*?)";')
 GO_REGEX = re.compile(r"\[(.*?)\]")

+# Takes cure in the form PREFIX:ID


Typo in comment: 'cure' should be 'curie'.

Suggested change

# Takes cure in the form PREFIX:ID

# Takes curie in the form PREFIX:ID

Copilot · 2025-08-13T17:09:11Z

kg_microbe/utils/uniprot_utils.py

+    except TypeError:
+        return None
+
+    else:


The function lacks proper error handling for HTTP requests. If the API is unavailable or returns an error status, the function will raise an exception without providing a meaningful error message to help with debugging.

Suggested change

else:

url = NODE_NORMALIZER_URL + node_curie

try:

# Make the HTTP request to NodeNormalizer

response = requests.get(url, timeout=30)

response.raise_for_status()

# Write response to file if it contains data

entries = response.json()[node_curie]

if len(entries) > 1: # .strip().split("\n")

for iden in entries["equivalent_identifiers"]:

if iden["identifier"].split(":")[0] + ":" == HGNC_NEW_PREFIX:

norm_node = iden["identifier"]

return norm_node

else:

return None

except requests.exceptions.RequestException as e:

logging.error(f"HTTP request failed for node_curie '{node_curie}' at URL '{url}': {e}")

return None

except (ValueError, KeyError, TypeError) as e:

logging.error(f"Error processing response for node_curie '{node_curie}' at URL '{url}': {e}")

Copilot · 2025-08-13T17:09:11Z

kg_microbe/utils/uniprot_utils.py

+    response.raise_for_status()
+
+    # Write response to file if it contains data
+    entries = response.json()[node_curie]


This line assumes the response JSON contains the node_curie key, but this could fail if the API response structure is different or if the key doesn't exist, potentially causing a KeyError.

Suggested change

entries = response.json()[node_curie]

entries = response.json().get(node_curie)

if entries is None:

return None

Copilot · 2025-08-13T17:09:12Z

kg_microbe/utils/uniprot_utils.py

+    # Write response to file if it contains data
+    entries = response.json()[node_curie]
+    try:
+        if len(entries) > 1:  # .strip().split("\n")


The condition len(entries) > 1 doesn't make logical sense here. The code is checking if entries has more than one element, but then accesses entries["equivalent_identifiers"], suggesting entries is a dictionary. The logic should verify entries is not None and contains the expected structure.

Suggested change

if len(entries) > 1: # .strip().split("\n")

if entries and "equivalent_identifiers" in entries and entries["equivalent_identifiers"]:

Copilot · 2025-08-13T17:09:12Z

kg_microbe/transform_utils/ontologies/ontologies_transform.py

            # Rewrite edges file
            with open(edges_file, "w") as new_ef:
-                new_ef.write("\t".join(self.edge_header) + "\n")
+                # new_ef.write("\t".join(self.edge_header) + "\n")


Commented-out code should be removed rather than left in the codebase. If this header writing logic is needed, it should be uncommented and used, otherwise it should be deleted.

Suggested change

# new_ef.write("\t".join(self.edge_header) + "\n")

fixes to ingest all HGNC genes for uniprot mappings

620287c

turbomam requested a review from Copilot August 13, 2025 17:07

Copilot AI reviewed Aug 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixes to ingest all HGNC genes for uniprot mappings #290

fixes to ingest all HGNC genes for uniprot mappings #290

Uh oh!

bsantan commented Feb 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# Takes cure in the form PREFIX:ID
	# Takes curie in the form PREFIX:ID

-    else:
+    url = NODE_NORMALIZER_URL + node_curie
+    try:
+        # Make the HTTP request to NodeNormalizer
+        response = requests.get(url, timeout=30)
+        response.raise_for_status()
+        # Write response to file if it contains data
+        entries = response.json()[node_curie]
+        if len(entries) > 1:  # .strip().split("\n")
+            for iden in entries["equivalent_identifiers"]:
+                if iden["identifier"].split(":")[0] + ":" == HGNC_NEW_PREFIX:
+                    norm_node = iden["identifier"]
+                    return norm_node
+        else:
+            return None
+    except requests.exceptions.RequestException as e:
+        logging.error(f"HTTP request failed for node_curie '{node_curie}' at URL '{url}': {e}")
+        return None
+    except (ValueError, KeyError, TypeError) as e:
+        logging.error(f"Error processing response for node_curie '{node_curie}' at URL '{url}': {e}")

	if len(entries) > 1: # .strip().split("\n")
	if entries and "equivalent_identifiers" in entries and entries["equivalent_identifiers"]:

fixes to ingest all HGNC genes for uniprot mappings #290

Are you sure you want to change the base?

fixes to ingest all HGNC genes for uniprot mappings #290

Uh oh!

Conversation

bsantan commented Feb 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants