Skip to content

Commit dc3cd4b

Browse files
authored
Merge pull request #36 from nextstrain/use-remote-nextclade-dataset
Assign genotypes using Nextclade dataset and visualize on tree
2 parents 8d5e4ae + 31a3fdf commit dc3cd4b

12 files changed

Lines changed: 3964 additions & 3807 deletions

File tree

CHANGES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# CHANGELOG
2+
* 7 June 2024: Assign genotypes using Nextclade dataset and visualize on tree [PR #36](https://github.com/nextstrain/measles/pull/36)
23
* 9 May 2024: Create a N450 tree that can be used as part of a Nextclade dataset to assign genotypes to measles samples based on criteria outlined by the WHO [PR #28](https://github.com/nextstrain/measles/pull/28)
34
* 25 April 2024: Add specific sequences and metadata to the measles trees, including WHO reference sequences, vaccine strains, and genotypes reported on NCBI [PR #26](https://github.com/nextstrain/measles/pull/26)
45
* 10 April 2024: Add a single GH Action workflow to automate the ingest and phylogenetic workflows [PR #22](https://github.com/nextstrain/measles/pull/22)

ingest/Snakefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ rule all:
2626
# by build specific rules.
2727
include: "rules/fetch_from_ncbi.smk"
2828
include: "rules/curate.smk"
29+
include: "rules/nextclade.smk"
2930

3031

3132
# Allow users to import custom rules provided via the config.

ingest/build-configs/nextstrain-automation/config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,6 @@ files_to_upload:
1919
ncbi.ndjson.zst: data/ncbi.ndjson
2020
metadata.tsv.zst: results/metadata.tsv
2121
sequences.fasta.zst: results/sequences.fasta
22+
nextclade.tsv.zst: results/nextclade.tsv
23+
alignment.fasta.zst: results/alignment.fasta
24+
translations.zip: results/translations.zip

ingest/defaults/config.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,7 @@ curate:
122122
'is_reference'
123123
]
124124
genotype_field: "virus_name"
125+
nextclade:
126+
dataset_name: "nextstrain/measles/N450/WHO-2012"
127+
field_map: "defaults/nextclade_field_map.tsv"
128+
id_field: "seqName"
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# TSV file that is a mapping of column names for Nextclade output TSV
2+
# The first column should be the original column name of the Nextclade TSV
3+
# The second column should be the new column name to use in the final metadata TSV
4+
# Nextclade can have pathogen specific output columns so make sure to check which
5+
# columns would be useful for your downstream phylogenetic analysis.
6+
seqName seqName
7+
clade clade
8+
coverage coverage
9+
totalMissing missing_data
10+
totalSubstitutions divergence
11+
totalNonACGTNs nonACGTN
12+
qc.overallStatus QC_overall
13+
qc.missingData.status QC_missing_data
14+
qc.mixedSites.status QC_mixed_sites
15+
qc.privateMutations.status QC_rare_mutations
16+
qc.snpClusters.status QC_snp_clusters
17+
qc.frameShifts.status QC_frame_shifts
18+
qc.stopCodons.status QC_stop_codons
19+
frameShifts frame_shifts
20+
privateNucMutations.reversionSubstitutions private_reversion_substitutions
21+
privateNucMutations.labeledSubstitutions private_labeled_substitutions
22+
privateNucMutations.unlabeledSubstitutions private_unlabeled_substitutions
23+
privateNucMutations.totalReversionSubstitutions private_total_reversion_substitutions
24+
privateNucMutations.totalLabeledSubstitutions private_total_labeled_substitutions
25+
privateNucMutations.totalUnlabeledSubstitutions private_total_unlabeled_substitutions
26+
privateNucMutations.totalPrivateSubstitutions private_total_private_substitutions
27+
qc.snpClusters.clusteredSNPs private_snp_clusters
28+
qc.snpClusters.totalSNPs private_total_snp_clusters

ingest/rules/curate.smk

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ rule subset_metadata:
122122
input:
123123
metadata="data/all_metadata.tsv",
124124
output:
125-
subset_metadata="results/metadata.tsv",
125+
subset_metadata="data/subset_metadata.tsv",
126126
params:
127127
metadata_fields=",".join(config["curate"]["metadata_columns"]),
128128
shell:

ingest/rules/nextclade.smk

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
"""
2+
This part of the workflow handles running Nextclade on the curated metadata
3+
and sequences.
4+
5+
See Nextclade docs for more details on usage, inputs, and outputs if you would
6+
like to customize the rules
7+
"""
8+
DATASET_NAME = config["nextclade"]["dataset_name"]
9+
10+
11+
rule get_nextclade_dataset:
12+
"""Download Nextclade dataset"""
13+
output:
14+
dataset=f"data/nextclade_data/{DATASET_NAME}.zip",
15+
params:
16+
dataset_name=DATASET_NAME
17+
shell:
18+
"""
19+
nextclade3 dataset get \
20+
--name={params.dataset_name:q} \
21+
--output-zip={output.dataset} \
22+
--verbose
23+
"""
24+
25+
26+
rule run_nextclade:
27+
input:
28+
dataset=f"data/nextclade_data/{DATASET_NAME}.zip",
29+
sequences="results/sequences.fasta",
30+
output:
31+
nextclade="results/nextclade.tsv",
32+
alignment="results/alignment.fasta",
33+
translations="results/translations.zip",
34+
params:
35+
translations=lambda w: "results/translations/{cds}.fasta",
36+
shell:
37+
"""
38+
nextclade3 run \
39+
{input.sequences} \
40+
--input-dataset {input.dataset} \
41+
--output-tsv {output.nextclade} \
42+
--output-fasta {output.alignment} \
43+
--output-translations {params.translations}
44+
45+
zip -rj {output.translations} results/translations
46+
"""
47+
48+
49+
rule join_metadata_and_nextclade:
50+
input:
51+
nextclade="results/nextclade.tsv",
52+
metadata="data/subset_metadata.tsv",
53+
nextclade_field_map=config["nextclade"]["field_map"],
54+
output:
55+
metadata="results/metadata.tsv",
56+
params:
57+
metadata_id_field=config["curate"]["output_id_field"],
58+
nextclade_id_field=config["nextclade"]["id_field"],
59+
shell:
60+
"""
61+
export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'`
62+
63+
csvtk -tl cut -f $SUBSET_FIELDS \
64+
{input.nextclade} \
65+
| csvtk -tl rename2 \
66+
-F \
67+
-f '*' \
68+
-p '(.+)' \
69+
-r '{{kv}}' \
70+
-k {input.nextclade_field_map} \
71+
| tsv-join -H \
72+
--filter-file - \
73+
--key-fields {params.nextclade_id_field} \
74+
--data-fields {params.metadata_id_field} \
75+
--append-fields '*' \
76+
--write-all ? \
77+
{input.metadata} \
78+
| tsv-select -H --exclude {params.nextclade_id_field} \
79+
> {output.metadata}
80+
"""

phylogenetic/defaults/auspice_config.json

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,18 +17,23 @@
1717
"type": "continuous"
1818
},
1919
{
20-
"key": "country",
21-
"title": "Country",
20+
"key": "clade",
21+
"title": "MeV Genotype (Nextstrain)",
2222
"type": "categorical"
2323
},
2424
{
2525
"key": "region",
2626
"title": "Region",
2727
"type": "categorical"
2828
},
29+
{
30+
"key": "country",
31+
"title": "Country",
32+
"type": "categorical"
33+
},
2934
{
3035
"key": "genotype_ncbi",
31-
"title": "Genotype (NCBI)",
36+
"title": "MeV Genotype (GenBank metadata)",
3237
"type": "categorical"
3338
}
3439
],
@@ -37,11 +42,13 @@
3742
"region"
3843
],
3944
"display_defaults": {
40-
"map_triplicate": true
45+
"map_triplicate": true,
46+
"color_by": "clade"
4147
},
4248
"filters": [
43-
"country",
49+
"clade",
4450
"region",
51+
"country",
4552
"author"
4653
],
4754
"metadata_columns": [

phylogenetic/defaults/auspice_config_N450.json

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,18 +17,23 @@
1717
"type": "continuous"
1818
},
1919
{
20-
"key": "country",
21-
"title": "Country",
20+
"key": "clade",
21+
"title": "MeV Genotype (Nextstrain)",
2222
"type": "categorical"
2323
},
2424
{
2525
"key": "region",
2626
"title": "Region",
2727
"type": "categorical"
2828
},
29+
{
30+
"key": "country",
31+
"title": "Country",
32+
"type": "categorical"
33+
},
2934
{
3035
"key": "genotype_ncbi",
31-
"title": "Genotype (NCBI)",
36+
"title": "MeV Genotype (GenBank metadata)",
3237
"type": "categorical"
3338
},
3439
{
@@ -42,11 +47,13 @@
4247
"region"
4348
],
4449
"display_defaults": {
45-
"map_triplicate": true
50+
"map_triplicate": true,
51+
"color_by": "clade"
4652
},
4753
"filters": [
48-
"country",
54+
"clade",
4955
"region",
56+
"country",
5057
"author"
5158
],
5259
"metadata_columns": [

phylogenetic/defaults/colors.tsv

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,29 @@ genotype_ncbi G2 #E67832
3131
genotype_ncbi G3 #E35F2D
3232
genotype_ncbi H1 #DF4328
3333
genotype_ncbi H2 #DB2823
34+
#
35+
# MeV Genotypes assigned by Nextclade
36+
clade A #5E1D9D
37+
clade B1 #4B26B1
38+
clade B2 #4138C3
39+
clade B3 #3F4FCC
40+
clade C1 #4065CF
41+
clade C2 #447ACD
42+
clade D1 #4A8BC3
43+
clade D2 #529AB6
44+
clade D3 #5BA6A6
45+
clade D4 #66AE95
46+
clade D5 #73B583
47+
clade D6 #81B973
48+
clade D7 #91BC64
49+
clade D8 #A1BE58
50+
clade D9 #B1BD4E
51+
clade D10 #C0BA47
52+
clade D11 #CEB541
53+
clade E #DAAD3D
54+
clade F #E19F3A
55+
clade G1 #E68E36
56+
clade G2 #E67832
57+
clade G3 #E35F2D
58+
clade H1 #DF4328
59+
clade H2 #DB2823

0 commit comments

Comments
 (0)