-
Notifications
You must be signed in to change notification settings - Fork 4
fixes to ingest all HGNC genes for uniprot mappings #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bsantan
commented
Feb 25, 2025
- HGNC gene IDs were only being mapped for those that exist in the MONDO ontology
- Now ingest the HGNC OWL file, and map all human Uniprot IDs to HGNC IDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR transitions from using MONDO ontology for HGNC gene mappings to directly using Node Normalizer API to map all human UniProt proteins to their corresponding HGNC gene IDs. The change addresses the limitation where only genes present in MONDO were being mapped, now enabling complete HGNC gene coverage for human UniProt entries.
Key changes:
- Replaced MONDO-based gene mapping with Node Normalizer API calls
- Added HGNC ontology processing to support comprehensive gene mappings
- Updated Node Normalizer API endpoint to version 1.5
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| kg_microbe/utils/uniprot_utils.py | Implements Node Normalizer API function and updates gene parsing logic |
| kg_microbe/transform_utils/wallen_etal/wallen_etal.py | Adds debugging statement |
| kg_microbe/transform_utils/uniprot_human/uniprot_human.py | Removes mondo_gene_dict parameter from function calls |
| kg_microbe/transform_utils/ontologies/ontologies_transform.py | Adds HGNC ontology processing and filters out gene property nodes |
| kg_microbe/transform_utils/constants.py | Adds HGNC-related prefixes and updates Node Normalizer URL |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| wallen_etal_df = pd.read_excel(input_file, skiprows=3, sheet_name=WALLEN_ETAL_TAB_NAME) | ||
| wallen_etal_df[FDR_COLUMN] = pd.to_numeric(wallen_etal_df[FDR_COLUMN], errors="coerce") | ||
|
|
||
| import pdb;pdb.set_trace() |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug statement with pdb.set_trace() should not be committed to production code. This will cause the program to halt execution and wait for debugger input.
| import pdb;pdb.set_trace() |
| CHEBI_REGEX = re.compile(r'/ligand_id="ChEBI:(.*?)";') | ||
| GO_REGEX = re.compile(r"\[(.*?)\]") | ||
|
|
||
| # Takes cure in the form PREFIX:ID |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: 'cure' should be 'curie'.
| # Takes cure in the form PREFIX:ID | |
| # Takes curie in the form PREFIX:ID |
| except TypeError: | ||
| return None | ||
|
|
||
| else: |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function lacks proper error handling for HTTP requests. If the API is unavailable or returns an error status, the function will raise an exception without providing a meaningful error message to help with debugging.
| else: | |
| url = NODE_NORMALIZER_URL + node_curie | |
| try: | |
| # Make the HTTP request to NodeNormalizer | |
| response = requests.get(url, timeout=30) | |
| response.raise_for_status() | |
| # Write response to file if it contains data | |
| entries = response.json()[node_curie] | |
| if len(entries) > 1: # .strip().split("\n") | |
| for iden in entries["equivalent_identifiers"]: | |
| if iden["identifier"].split(":")[0] + ":" == HGNC_NEW_PREFIX: | |
| norm_node = iden["identifier"] | |
| return norm_node | |
| else: | |
| return None | |
| except requests.exceptions.RequestException as e: | |
| logging.error(f"HTTP request failed for node_curie '{node_curie}' at URL '{url}': {e}") | |
| return None | |
| except (ValueError, KeyError, TypeError) as e: | |
| logging.error(f"Error processing response for node_curie '{node_curie}' at URL '{url}': {e}") |
| response.raise_for_status() | ||
|
|
||
| # Write response to file if it contains data | ||
| entries = response.json()[node_curie] |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line assumes the response JSON contains the node_curie key, but this could fail if the API response structure is different or if the key doesn't exist, potentially causing a KeyError.
| entries = response.json()[node_curie] | |
| entries = response.json().get(node_curie) | |
| if entries is None: | |
| return None |
| # Write response to file if it contains data | ||
| entries = response.json()[node_curie] | ||
| try: | ||
| if len(entries) > 1: # .strip().split("\n") |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition len(entries) > 1 doesn't make logical sense here. The code is checking if entries has more than one element, but then accesses entries["equivalent_identifiers"], suggesting entries is a dictionary. The logic should verify entries is not None and contains the expected structure.
| if len(entries) > 1: # .strip().split("\n") | |
| if entries and "equivalent_identifiers" in entries and entries["equivalent_identifiers"]: |
| # Rewrite edges file | ||
| with open(edges_file, "w") as new_ef: | ||
| new_ef.write("\t".join(self.edge_header) + "\n") | ||
| # new_ef.write("\t".join(self.edge_header) + "\n") |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed rather than left in the codebase. If this header writing logic is needed, it should be uncommented and used, otherwise it should be deleted.
| # new_ef.write("\t".join(self.edge_header) + "\n") |