UniRef90 transform #170

hrshdhgd · 2024-05-10T04:14:58Z

No description provided.

realmarcin · 2024-07-23T19:06:15Z

This transform works on NERSC and produces correct edges and nodes. However, we have a critical data gap here in that we did not find a way to obtain the taxon UniProt proteins membership to the UniRef90 clusters. Thus we can merge UniRef90 transform but it will be unlinked from taxa and proteins at the moment.

@hrshdhgd

Copilot

Pull Request Overview

This PR adds a new UniRef90 transformation module to process UniRef protein cluster data. The changes integrate UniRef data into the knowledge graph by creating nodes for NCBI taxonomy entities and protein clusters, along with edges representing taxonomic occurrence relationships.

Implements UnirefTransform class to process UniRef90 API subset data
Adds necessary constants and configurations for UniRef data integration
Updates merge configuration to include UniRef data sources

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
merge.yaml	Adds UniRef data source configuration to the merge pipeline
kg_microbe/transform_utils/uniref/uniref.py	Core transformation logic for processing UniRef90 data
kg_microbe/transform_utils/uniref/init.py	Package initialization file for UniRef transform module
kg_microbe/transform_utils/constants.py	Adds UniRef-specific constants and prefixes
kg_microbe/transform.py	Registers UnirefTransform in the available data sources

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-13T16:52:27Z

kg_microbe/transform_utils/uniref/uniref.py

+                            ncbitaxon_ids, ncbi_labels, strict=False
+                        )
+                    ]
+                    # nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name])


Remove commented-out code. The cluster node creation is handled in lines 92-94, making this commented line redundant.

Suggested change

# nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name])

Copilot · 2025-08-13T16:52:27Z

kg_microbe/transform_utils/uniref/uniref.py

+
+                    progress.set_description(f"Processing Cluster: {cluster_id}")
+                    # After each iteration, call the update method to advance the progress bar.
+                    progress.update(2000)


The magic number 2000 for progress updates is unclear. Consider making this a named constant or comment explaining why this specific value is used.

Suggested change

progress.update(2000)

progress.update(PROGRESS_UPDATE_INTERVAL)

Copilot · 2025-08-13T16:52:28Z

kg_microbe/transform_utils/uniref/uniref.py

+                        for sublist in nodes_data_to_write
+                    ]
+                    node_writer.writerows(nodes_data_to_write)
+                    gc.collect()


Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).

Suggested change

gc.collect()

# Call gc.collect() every 1000 rows

if row_count % 1000 == 0:

gc.collect()

Copilot · 2025-08-13T16:52:28Z

kg_microbe/transform_utils/uniref/uniref.py

+                        for ncbitaxon_id in ncbitaxon_ids
+                    ]
+                    edge_writer.writerows(edges_data_to_write)
+                    gc.collect()


Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).

Suggested change

gc.collect()

hrshdhgd and others added 14 commits May 9, 2024 23:14

UniRef90 transform

ae9005c

update maxsize

8504631

formatted

3008a68

update gitignore

4a696c8

update merge file

0d08846

slower but may not crash ...testing

142ffc0

formatted

79b03db

update progress by 500

8888377

list comprehension back and added garbage collection

511cdb4

bug fix

59a7c2c

minor update

6387605

minor refactor

2938f86

Merge branch 'master' into uniref-transform

9a59148

Merge branch 'master' into uniref-transform

75116ea

turbomam requested a review from Copilot August 13, 2025 16:51

Copilot AI reviewed Aug 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UniRef90 transform #170

UniRef90 transform #170

Uh oh!

hrshdhgd commented May 10, 2024

Uh oh!

realmarcin commented Jul 23, 2024

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	progress.update(2000)
	progress.update(PROGRESS_UPDATE_INTERVAL)

UniRef90 transform #170

Are you sure you want to change the base?

UniRef90 transform #170

Uh oh!

Conversation

hrshdhgd commented May 10, 2024

Uh oh!

realmarcin commented Jul 23, 2024

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants