-
Notifications
You must be signed in to change notification settings - Fork 7
feat: generate taxa tree from assemblies and display accurate assembly count in sunburst (#522) #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR switches the taxa tree source to the assemblies DataFrame for consistent catalog alignment and surface accurate assembly counts throughout the UI.
- Build scripts: generate the tree from
assemblies_df
, include per-nodeassembly_count
, and add taxonomic ID columns. - Front-end: update
TaxonomyNode
interface and pluralized link text to show counts. - Documentation/data: refresh QC report ordering and outbreak-taxonomy mapping TSV.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
catalog/output/qc-report.md | Reordered species/strain taxonomy ID lists for consistency |
catalog/build/py/package/catalog_build/build.py | Replaced NCBI API calls with DataFrame-based subtree builder; added get_taxonomic_level_id_key , ID fields, and assembly_count |
catalog/build/intermediate/outbreak-taxonomy-mapping.tsv | Updated and re-sorted outbreak taxonomy ID entries |
app/components/Home/components/Section/components/SectionViz/data.ts | Added assembly_count to TaxonomyNode interface |
app/components/Home/components/Section/components/SectionViz/NodeDetails.tsx | Pluralized link text to show the number of assemblies per node |
Comments suppressed due to low confidence (2)
catalog/build/py/package/catalog_build/build.py:294
- Add unit tests for
get_species_subtree
to verify it builds the correct nested structure, respects ordering, and setsassembly_count
properly.
def get_species_subtree(
app/components/Home/components/Section/components/SectionViz/NodeDetails.tsx:126
- Add tests (unit or snapshot) to confirm the link text and pluralization display correctly for 0, 1, and multiple assemblies.
filterLinkText = `View ${assemblyCount} Assembl${assemblySuffix} for ${nodeName}`;
@@ -121,7 +121,9 @@ export const NodeDetails: React.FC<NodeDetailsProps> = ({ | |||
// Create the appropriate display text based on node type | |||
let filterLinkText = "Browse All Assemblies"; | |||
if (!isRoot) { | |||
filterLinkText = `View assemblies for ${nodeName}`; | |||
const assemblyCount = node.data.assembly_count; | |||
const assemblySuffix = assemblyCount > 1 ? "ies" : "y"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pluralization logic treats 0 as singular ("Assembly"). Change the condition to assemblyCount !== 1
so that 0, 2+ use the plural form.
const assemblySuffix = assemblyCount > 1 ? "ies" : "y"; | |
const assemblySuffix = assemblyCount !== 1 ? "ies" : "y"; |
Copilot uses AI. Check for mistakes.
@@ -159,26 +159,40 @@ def get_taxonomic_level_key(level): | |||
return f"taxonomicLevel{level[0].upper()}{level[1:]}" | |||
|
|||
|
|||
def get_taxonomic_level_id_key(level): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Add a docstring explaining that this helper constructs the field name for a taxonomic level's ID, improving function discoverability and readability.
def get_taxonomic_level_id_key(level): | |
def get_taxonomic_level_id_key(level): | |
""" | |
Constructs the field name for a taxonomic level's ID. | |
This helper function appends "Id" to the formatted taxonomic level key | |
(e.g., "taxonomicLevelGenus" becomes "taxonomicLevelGenusId"). | |
Args: | |
level (str): The taxonomic level (e.g., "genus", "species"). | |
Returns: | |
str: The constructed field name for the taxonomic level's ID. | |
""" |
Copilot uses AI. Check for mistakes.
species_tree, taxon_name_map, taxon_rank_map | ||
# Generate the tree, filling NA in the assemblies dataframe to enable grouping for absent values | ||
return get_species_subtree( | ||
"root", "1", "NA", assemblies_df.fillna(""), taxonomic_levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment refers to 'NA' but the code uses fillna("")
. Either update the comment to match the actual placeholder or switch to fillna("NA")
for clarity.
"root", "1", "NA", assemblies_df.fillna(""), taxonomic_levels | |
"root", "1", "NA", assemblies_df.fillna("NA"), taxonomic_levels |
Copilot uses AI. Check for mistakes.
# Use the assemblies info from genomes_df to build the species tree | ||
species_tree = get_species_tree(genomes_df, taxonomic_levels_for_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable is named genomes_df
but passed as assemblies_df
into get_species_tree
. Consider renaming one or the other for consistency.
# Use the assemblies info from genomes_df to build the species tree | |
species_tree = get_species_tree(genomes_df, taxonomic_levels_for_tree) | |
# Use the assemblies info from assemblies_df to build the species tree | |
species_tree = get_species_tree(assemblies_df, taxonomic_levels_for_tree) |
Copilot uses AI. Check for mistakes.
Could you give this a review @d-callan? I did end up removing the NCBI call for building the tree; hopefully it's clear why I think this approach is more practical, but do let me know if there's some shortcoming I may have overlooked! Thanks |
…nomic_levels_for_tree` (#522)
727af5b
to
f7f4171
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so ill admit i like this better than i thought i would. one thing that does make me nervous.. unless i misunderstood something, it seems like weve tied ourselves to the order of taxonomic levels being biologically relevant? i have mixed feelings about this: on the one hand that makes the logic here simple enough i trust it to build a valid tree, but i am not sure i trust anyone who might need to modify these levels later to know the order is important. at absolute min id say we need clear docs for this in the place the levels are actually defined.
genomes_output_path: Path to save output assemblies TSV at | ||
ucsc_assemblies_url: URL to fetch UCSC's assembly information from | ||
tree_output_path: Path to save tree of cataloged taxa to | ||
taxonomic_levels_for_tree: List of lowercase taxonomic ranks, as identified by NCBI, to build tree from; listed from highest to lowest level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we need a better description of this argument/ or probably a very obvious note where these levels are defined that says passing these in the wrong order will produce nonsense trees. also, to say highest to lowest level is a bit misleading, since things like domain and realm dont have that type of relationship. that feels a bit nitpicky, but given how easily i think someone could mess this up and how important itd be if they did, id favor more words over less.
actually ive realized i have one further thought on this. now that the assembly counts are accurate, they are more confusing. on the top of the child taxa list to the right of the sunburst we show counts of assemblies, but within the list we show counts of leaves (which are close but not quite the same as the organisms count). probably we should remove the count of leaf nodes. it wont be reliably guaranteed to match any other number we show on the site i dont think, isnt showing a preview of how many things well get if we click some link, and generally doesnt seem like its adding a lot of value to me. |
sry. yet another 'one more thought'.. does it make sense to try to add something about the tree to our qc report? i was trying to think of a way to make sure we actually build the thing we say we will. unit tests dont really help unless they use the actual taxonomic levels that we will for realz, feel like theyd serve a bit of a different purpose. but maybe a bit in the qc report that checks for number of tiers in the tree, number of leaf nodes, etc or something? |
@d-callan Alright, I've made some updates!
|
Thanks. I said number tiers to try to catch cases where the levels were passed out of order, figuring whole levels would be missing say. But I guess what I really meant was maybe to check the actual levels/ranks claimed in the nodes and make sure all the levels desired actually exist somewhere in the total tree across all lineages. |
@d-callan Hmm, makes sense -- added a check for whether all the specified ranks actually appear in the tree! |
Closes #522
Updates the taxa tree to be built from the assemblies dataframe rather than the NCBI tree API; this makes it straightforward for the tree data to accurately reflect the catalog data in general (notably, there was previously some disconnect between which taxa appear in the tree and which appear in the catalog), as well as making assembly counts in particular straightforward to associate with tree nodes. Additionally, dataframe grouping makes the tree structure simple to derive.