Skip to content

Ingest with nextclade#62

Open
joverlee521 wants to merge 5 commits into
masterfrom
ingest-with-nextclade
Open

Ingest with nextclade#62
joverlee521 wants to merge 5 commits into
masterfrom
ingest-with-nextclade

Conversation

@joverlee521

Copy link
Copy Markdown
Contributor

Description of proposed changes

Runs Nextclade as part of the ingest workflow so that we get Nextclade clade annotations for all H5 HA sequences.
Uses the community/moncla-lab/iav-h5/ha/all-clades Nextclade dataset.

Related issue(s)

Resolves #44

Checklist

  • Checks pass
  • Trial fauna ingest
    -> results at s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-with-nextclade/metadata.tsv.zst
  • Trial NCBI ingest -> results at s3://nextstrain-data/files/workflows/avian-flu/trial/ingest-with-nextclade/h5n1/metadata.tsv.zst

joverlee521 added a commit that referenced this pull request Jun 24, 2024
Motivated by my own need to test the ingest workflows for the latest
addition of Nextclade outputs in #62.
@joverlee521 joverlee521 force-pushed the ingest-with-nextclade branch from 58df478 to f33157f Compare June 24, 2024 20:20
Using `community/moncla-lab/iav-h5/ha/all-clades` as the default
Nextclade dataset since it works across fauna and NCBI data.

Subsequent commits will join these rules with the full ingest
workflows.
Using the nextclade_field_map that's currently used in the measles
ingest workflow.¹ We can cut down on the columns used if they are not
useful for avian flu.

¹ <https://github.com/nextstrain/measles/blob/957fc744c64b8f5a722b5c525687d0746755add6/ingest/defaults/nextclade_field_map.tsv>
We are not using the alignment.fasta anywhere and I don't think
it makes sense to only upload alignment for the HA segment.
Keep a copy of the full Nextclade TSV output from ingest on S3
since we won't necessarily join all columns with the metadata output.
@joverlee521 joverlee521 force-pushed the ingest-with-nextclade branch from f33157f to 8adda40 Compare June 24, 2024 20:22
Comment on lines +8 to +28
coverage coverage
totalMissing missing_data
totalSubstitutions divergence
totalNonACGTNs nonACGTN
qc.overallStatus QC_overall
qc.missingData.status QC_missing_data
qc.mixedSites.status QC_mixed_sites
qc.privateMutations.status QC_rare_mutations
qc.snpClusters.status QC_snp_clusters
qc.frameShifts.status QC_frame_shifts
qc.stopCodons.status QC_stop_codons
frameShifts frame_shifts
privateNucMutations.reversionSubstitutions private_reversion_substitutions
privateNucMutations.labeledSubstitutions private_labeled_substitutions
privateNucMutations.unlabeledSubstitutions private_unlabeled_substitutions
privateNucMutations.totalReversionSubstitutions private_total_reversion_substitutions
privateNucMutations.totalLabeledSubstitutions private_total_labeled_substitutions
privateNucMutations.totalUnlabeledSubstitutions private_total_unlabeled_substitutions
privateNucMutations.totalPrivateSubstitutions private_total_private_substitutions
qc.snpClusters.clusteredSNPs private_snp_clusters
qc.snpClusters.totalSNPs private_total_snp_clusters

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only keep the clade assignment and drop all of these other columns? These QC outputs are specific to the HA segment so it might not make sense to keep as part of the overall metadata.

@joverlee521 joverlee521 requested a review from a team June 24, 2024 20:25
@joverlee521 joverlee521 requested a review from lmoncla June 24, 2024 20:53
# Nextclade can have pathogen specific output columns so make sure to check which
# columns would be useful for your downstream phylogenetic analysis.
seqName seqName
clade clade

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to include the polybasic_cleavage_site output column or should the phylogenetic builds continue to rely on scripts/annotate-ha-cleavage-site.py?

Comment on lines +68 to +72
{
"key": "clade",
"title": "Nextclade Clade",
"type": "categorical"
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the Nextclade Clade as a separate coloring so we can do comparisons across clade labels, but maybe we'll remove h5_label_clade eventually? Would love to hear your thoughts here @lmoncla 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ingest: Run Nextclade as part of ingest

2 participants