-
Notifications
You must be signed in to change notification settings - Fork 4
UniRef90 transform #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UniRef90 transform #170
Conversation
|
This transform works on NERSC and produces correct edges and nodes. However, we have a critical data gap here in that we did not find a way to obtain the taxon UniProt proteins membership to the UniRef90 clusters. Thus we can merge UniRef90 transform but it will be unlinked from taxa and proteins at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new UniRef90 transformation module to process UniRef protein cluster data. The changes integrate UniRef data into the knowledge graph by creating nodes for NCBI taxonomy entities and protein clusters, along with edges representing taxonomic occurrence relationships.
- Implements UnirefTransform class to process UniRef90 API subset data
- Adds necessary constants and configurations for UniRef data integration
- Updates merge configuration to include UniRef data sources
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| merge.yaml | Adds UniRef data source configuration to the merge pipeline |
| kg_microbe/transform_utils/uniref/uniref.py | Core transformation logic for processing UniRef90 data |
| kg_microbe/transform_utils/uniref/init.py | Package initialization file for UniRef transform module |
| kg_microbe/transform_utils/constants.py | Adds UniRef-specific constants and prefixes |
| kg_microbe/transform.py | Registers UnirefTransform in the available data sources |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ncbitaxon_ids, ncbi_labels, strict=False | ||
| ) | ||
| ] | ||
| # nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name]) |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code. The cluster node creation is handled in lines 92-94, making this commented line redundant.
| # nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name]) |
|
|
||
| progress.set_description(f"Processing Cluster: {cluster_id}") | ||
| # After each iteration, call the update method to advance the progress bar. | ||
| progress.update(2000) |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 2000 for progress updates is unclear. Consider making this a named constant or comment explaining why this specific value is used.
| progress.update(2000) | |
| progress.update(PROGRESS_UPDATE_INTERVAL) |
| for sublist in nodes_data_to_write | ||
| ] | ||
| node_writer.writerows(nodes_data_to_write) | ||
| gc.collect() |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).
| gc.collect() | |
| # Call gc.collect() every 1000 rows | |
| if row_count % 1000 == 0: | |
| gc.collect() |
| for ncbitaxon_id in ncbitaxon_ids | ||
| ] | ||
| edge_writer.writerows(edges_data_to_write) | ||
| gc.collect() |
Copilot
AI
Aug 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).
| gc.collect() |
No description provided.