Skip to content

Conversation

@hrshdhgd
Copy link
Contributor

No description provided.

@realmarcin
Copy link
Collaborator

This transform works on NERSC and produces correct edges and nodes. However, we have a critical data gap here in that we did not find a way to obtain the taxon UniProt proteins membership to the UniRef90 clusters. Thus we can merge UniRef90 transform but it will be unlinked from taxa and proteins at the moment.

@hrshdhgd

@turbomam turbomam requested a review from Copilot August 13, 2025 16:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new UniRef90 transformation module to process UniRef protein cluster data. The changes integrate UniRef data into the knowledge graph by creating nodes for NCBI taxonomy entities and protein clusters, along with edges representing taxonomic occurrence relationships.

  • Implements UnirefTransform class to process UniRef90 API subset data
  • Adds necessary constants and configurations for UniRef data integration
  • Updates merge configuration to include UniRef data sources

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
merge.yaml Adds UniRef data source configuration to the merge pipeline
kg_microbe/transform_utils/uniref/uniref.py Core transformation logic for processing UniRef90 data
kg_microbe/transform_utils/uniref/init.py Package initialization file for UniRef transform module
kg_microbe/transform_utils/constants.py Adds UniRef-specific constants and prefixes
kg_microbe/transform.py Registers UnirefTransform in the available data sources

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ncbitaxon_ids, ncbi_labels, strict=False
)
]
# nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name])
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out code. The cluster node creation is handled in lines 92-94, making this commented line redundant.

Suggested change
# nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name])

Copilot uses AI. Check for mistakes.

progress.set_description(f"Processing Cluster: {cluster_id}")
# After each iteration, call the update method to advance the progress bar.
progress.update(2000)
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 2000 for progress updates is unclear. Consider making this a named constant or comment explaining why this specific value is used.

Suggested change
progress.update(2000)
progress.update(PROGRESS_UPDATE_INTERVAL)

Copilot uses AI. Check for mistakes.
for sublist in nodes_data_to_write
]
node_writer.writerows(nodes_data_to_write)
gc.collect()
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).

Suggested change
gc.collect()
# Call gc.collect() every 1000 rows
if row_count % 1000 == 0:
gc.collect()

Copilot uses AI. Check for mistakes.
for ncbitaxon_id in ncbitaxon_ids
]
edge_writer.writerows(edges_data_to_write)
gc.collect()
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).

Suggested change
gc.collect()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants