Skip to content

feat: generate taxa tree from assemblies and display accurate assembly count in sunburst (#522) #550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hunterckx
Copy link
Collaborator

Closes #522

Updates the taxa tree to be built from the assemblies dataframe rather than the NCBI tree API; this makes it straightforward for the tree data to accurately reflect the catalog data in general (notably, there was previously some disconnect between which taxa appear in the tree and which appear in the catalog), as well as making assembly counts in particular straightforward to associate with tree nodes. Additionally, dataframe grouping makes the tree structure simple to derive.

@github-actions github-actions bot added the feat label May 23, 2025
@hunterckx hunterckx marked this pull request as draft May 23, 2025 22:07
@NoopDog NoopDog requested a review from Copilot May 24, 2025 15:12
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR switches the taxa tree source to the assemblies DataFrame for consistent catalog alignment and surface accurate assembly counts throughout the UI.

  • Build scripts: generate the tree from assemblies_df, include per-node assembly_count, and add taxonomic ID columns.
  • Front-end: update TaxonomyNode interface and pluralized link text to show counts.
  • Documentation/data: refresh QC report ordering and outbreak-taxonomy mapping TSV.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
catalog/output/qc-report.md Reordered species/strain taxonomy ID lists for consistency
catalog/build/py/package/catalog_build/build.py Replaced NCBI API calls with DataFrame-based subtree builder; added get_taxonomic_level_id_key, ID fields, and assembly_count
catalog/build/intermediate/outbreak-taxonomy-mapping.tsv Updated and re-sorted outbreak taxonomy ID entries
app/components/Home/components/Section/components/SectionViz/data.ts Added assembly_count to TaxonomyNode interface
app/components/Home/components/Section/components/SectionViz/NodeDetails.tsx Pluralized link text to show the number of assemblies per node
Comments suppressed due to low confidence (2)

catalog/build/py/package/catalog_build/build.py:294

  • Add unit tests for get_species_subtree to verify it builds the correct nested structure, respects ordering, and sets assembly_count properly.
def get_species_subtree(

app/components/Home/components/Section/components/SectionViz/NodeDetails.tsx:126

  • Add tests (unit or snapshot) to confirm the link text and pluralization display correctly for 0, 1, and multiple assemblies.
filterLinkText = `View ${assemblyCount} Assembl${assemblySuffix} for ${nodeName}`;

@@ -121,7 +121,9 @@ export const NodeDetails: React.FC<NodeDetailsProps> = ({
// Create the appropriate display text based on node type
let filterLinkText = "Browse All Assemblies";
if (!isRoot) {
filterLinkText = `View assemblies for ${nodeName}`;
const assemblyCount = node.data.assembly_count;
const assemblySuffix = assemblyCount > 1 ? "ies" : "y";
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pluralization logic treats 0 as singular ("Assembly"). Change the condition to assemblyCount !== 1 so that 0, 2+ use the plural form.

Suggested change
const assemblySuffix = assemblyCount > 1 ? "ies" : "y";
const assemblySuffix = assemblyCount !== 1 ? "ies" : "y";

Copilot uses AI. Check for mistakes.

@@ -159,26 +159,40 @@ def get_taxonomic_level_key(level):
return f"taxonomicLevel{level[0].upper()}{level[1:]}"


def get_taxonomic_level_id_key(level):
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Add a docstring explaining that this helper constructs the field name for a taxonomic level's ID, improving function discoverability and readability.

Suggested change
def get_taxonomic_level_id_key(level):
def get_taxonomic_level_id_key(level):
"""
Constructs the field name for a taxonomic level's ID.
This helper function appends "Id" to the formatted taxonomic level key
(e.g., "taxonomicLevelGenus" becomes "taxonomicLevelGenusId").
Args:
level (str): The taxonomic level (e.g., "genus", "species").
Returns:
str: The constructed field name for the taxonomic level's ID.
"""

Copilot uses AI. Check for mistakes.

species_tree, taxon_name_map, taxon_rank_map
# Generate the tree, filling NA in the assemblies dataframe to enable grouping for absent values
return get_species_subtree(
"root", "1", "NA", assemblies_df.fillna(""), taxonomic_levels
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment refers to 'NA' but the code uses fillna(""). Either update the comment to match the actual placeholder or switch to fillna("NA") for clarity.

Suggested change
"root", "1", "NA", assemblies_df.fillna(""), taxonomic_levels
"root", "1", "NA", assemblies_df.fillna("NA"), taxonomic_levels

Copilot uses AI. Check for mistakes.

Comment on lines +1235 to +1236
# Use the assemblies info from genomes_df to build the species tree
species_tree = get_species_tree(genomes_df, taxonomic_levels_for_tree)
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable is named genomes_df but passed as assemblies_df into get_species_tree. Consider renaming one or the other for consistency.

Suggested change
# Use the assemblies info from genomes_df to build the species tree
species_tree = get_species_tree(genomes_df, taxonomic_levels_for_tree)
# Use the assemblies info from assemblies_df to build the species tree
species_tree = get_species_tree(assemblies_df, taxonomic_levels_for_tree)

Copilot uses AI. Check for mistakes.

@hunterckx hunterckx marked this pull request as ready for review May 25, 2025 19:47
@hunterckx hunterckx requested a review from d-callan May 25, 2025 19:47
@hunterckx
Copy link
Collaborator Author

Could you give this a review @d-callan? I did end up removing the NCBI call for building the tree; hopefully it's clear why I think this approach is more practical, but do let me know if there's some shortcoming I may have overlooked! Thanks

@hunterckx hunterckx force-pushed the hunter/522-fix-sunburst-counts branch from 727af5b to f7f4171 Compare May 25, 2025 19:49
Copy link
Collaborator

@d-callan d-callan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so ill admit i like this better than i thought i would. one thing that does make me nervous.. unless i misunderstood something, it seems like weve tied ourselves to the order of taxonomic levels being biologically relevant? i have mixed feelings about this: on the one hand that makes the logic here simple enough i trust it to build a valid tree, but i am not sure i trust anyone who might need to modify these levels later to know the order is important. at absolute min id say we need clear docs for this in the place the levels are actually defined.

genomes_output_path: Path to save output assemblies TSV at
ucsc_assemblies_url: URL to fetch UCSC's assembly information from
tree_output_path: Path to save tree of cataloged taxa to
taxonomic_levels_for_tree: List of lowercase taxonomic ranks, as identified by NCBI, to build tree from; listed from highest to lowest level
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need a better description of this argument/ or probably a very obvious note where these levels are defined that says passing these in the wrong order will produce nonsense trees. also, to say highest to lowest level is a bit misleading, since things like domain and realm dont have that type of relationship. that feels a bit nitpicky, but given how easily i think someone could mess this up and how important itd be if they did, id favor more words over less.

@d-callan
Copy link
Collaborator

actually ive realized i have one further thought on this. now that the assembly counts are accurate, they are more confusing. on the top of the child taxa list to the right of the sunburst we show counts of assemblies, but within the list we show counts of leaves (which are close but not quite the same as the organisms count). probably we should remove the count of leaf nodes. it wont be reliably guaranteed to match any other number we show on the site i dont think, isnt showing a preview of how many things well get if we click some link, and generally doesnt seem like its adding a lot of value to me.

@d-callan
Copy link
Collaborator

sry. yet another 'one more thought'.. does it make sense to try to add something about the tree to our qc report? i was trying to think of a way to make sure we actually build the thing we say we will. unit tests dont really help unless they use the actual taxonomic levels that we will for realz, feel like theyd serve a bit of a different purpose. but maybe a bit in the qc report that checks for number of tiers in the tree, number of leaf nodes, etc or something?

@hunterckx
Copy link
Collaborator Author

@d-callan Alright, I've made some updates!

  • Added additional info to the docstring for taxonomic_levels_for_tree, and added a comment to the taxonomic levels list in build-files-from-ncbi.py

  • Opened a ticket (Resolve conflict between counts in sunburst side panel #564) to resolve the confusing counts, since I think that's outside the scope of this PR

  • Added QC checks for:

    • Whether the combined assembly counts of all species nodes match the total assembly count (I considered checking assembly counts of leaf nodes but realized that wouldn't work cleanly because even if a species node has children, its count might include assemblies that don't have a strain and so aren't represented among those children)
    • Lineages that don't have a species node (since the previous check assumes they do)
    • Nodes marked as having 0 assemblies

    (As far as your specific suggestions, I think checking number of tiers wouldn't necessarily work since some tiers might be skipped in a given lineage, and I'm not sure number of leaf nodes corresponds to anything convenient)

@d-callan
Copy link
Collaborator

Thanks. I said number tiers to try to catch cases where the levels were passed out of order, figuring whole levels would be missing say. But I guess what I really meant was maybe to check the actual levels/ranks claimed in the nodes and make sure all the levels desired actually exist somewhere in the total tree across all lineages.

@hunterckx
Copy link
Collaborator Author

@d-callan Hmm, makes sense -- added a check for whether all the specified ranks actually appear in the tree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Fix sunburst counts
2 participants