|
| 1 | +# Genomic epidemiology with mixed samples: a tutorial |
| 2 | +This tutorial constains instructions on how to repproduce the results |
| 3 | +of the three main synthetic experiments from *Genomic epidemiology |
| 4 | +with mixed samples*, Mäklin et al. 2020, in preparation. |
| 5 | + |
| 6 | +The tutorial will focus on reproducing the *Escherichia coli* |
| 7 | +experiment but contains instructions on how to adapt the scripts to |
| 8 | +the *Enterococcus faecalis* and *Staphylococcus aureus* experiments. |
| 9 | + |
| 10 | +For quick instructions on how to run the pipeline in a general |
| 11 | +setting, please refer to the README.md file in the root of this |
| 12 | +repository. |
| 13 | + |
| 14 | +## Requirements |
| 15 | +### mGEMS pipeline |
| 16 | +- [Themisto](https://github.com/algbio/Themisto) |
| 17 | +- [mSWEEP](https://github.com/probic/mSWEEP) |
| 18 | +- [mGEMS](https://github.com/probic/mGEMS) |
| 19 | +- [shovill](https://github.com/tseemann/shovill/) |
| 20 | + |
| 21 | +### Phylogenetic analysis |
| 22 | +- [snippy](https://github.com/tseemann/snippy/) |
| 23 | +- [RAxML-NG](https://github.com/amkozlov/raxml-ng) |
| 24 | + |
| 25 | +### Extra (macOS only) |
| 26 | +#### GNU coreutils |
| 27 | +On a macOS system, you'll also need to install GNU coreutils from |
| 28 | +homebrew and alias the macOS zcat command to the GNU zcat command for |
| 29 | +the duration of the session |
| 30 | +``` |
| 31 | +brew install coreutils |
| 32 | +alias zcat=gzcat |
| 33 | +ulimit -n 2048 |
| 34 | +``` |
| 35 | +#### Concurrent file connections limit |
| 36 | +macOS also limits the number of concurrent file connections, which |
| 37 | +will have to be increased to run Themisto and shovill |
| 38 | +``` |
| 39 | +ulimit -n 2048 |
| 40 | +``` |
| 41 | + |
| 42 | +## Tutorial |
| 43 | +### Table of Contents |
| 44 | + |
| 45 | +- [Select a species](#selectspecies) |
| 46 | +- [Reference data](#referencedata) |
| 47 | +- [Synthetic mixed samples](#mixedsamples) |
| 48 | +- [Indexing](#indexing) |
| 49 | +- [Pseudoalignment](#pseudoalignment) |
| 50 | +- [Abundance estimation](#estimation) |
| 51 | +- [Binning](#binning) |
| 52 | +- [Assembly](#assembly) |
| 53 | +- [SNP calling](#snpcalling) |
| 54 | +- [Phylogenetic inference](#phylogenetics) |
| 55 | + |
| 56 | + |
| 57 | +### <a name="selectspecies"></a>Select a species |
| 58 | +Download the supplementary table from the mGEMS manucsript which |
| 59 | +contains the relevant information |
| 60 | +``` |
| 61 | +wget https://zenodo.org/record/3724144/files/mGEMS_Supplementary_Table_mixed_samples.tsv |
| 62 | +``` |
| 63 | +Filter the table to contain only the *E. coli* (ecoli) experiments |
| 64 | +``` |
| 65 | +grep "ecoli" mGEMS_Supplementary_Table_mixed_samples.tsv" > mixed_samples.tsv |
| 66 | +``` |
| 67 | +If you want to reproduce the *E. faecalis* experiments, change 'ecoli' |
| 68 | +to 'efaec'. For *S. aureus*, change 'ecoli' to 'saur'. Running these |
| 69 | +other two experiments may require resources beyond the typical laptop or |
| 70 | +desktop computer. |
| 71 | + |
| 72 | +### <a name="referencedata"></a>Reference data |
| 73 | +The reference data from Mäklin et al. is available from zenodo |
| 74 | +- [*E. coli*](https://zenodo.org/record/3724112) |
| 75 | +- [*E. faecalis*](https://zenodo.org/record/3724101) |
| 76 | +- [*S. aureus*](https://zenodo.org/record/3724135) |
| 77 | + |
| 78 | +Construction of the reference dataset(s) is describe in more detail in |
| 79 | +Mäklin et al. 2020. |
| 80 | + |
| 81 | +Download and extract the *E. coli* dataset by running |
| 82 | +``` |
| 83 | +wget https://zenodo.org/record/3724112/files/mGEMS-ecoli-reference-v1.0.0.tar.gz |
| 84 | +tar -zxvf mGEMS-ecoli-reference-v1.0.0.tar.gz |
| 85 | +``` |
| 86 | + |
| 87 | +### <a name="indexing"></a>Indexing |
| 88 | +Create a *31*-mer pseudoalignment index with Themisto using two |
| 89 | +threads and maximum 8192 megabytes of RAM. |
| 90 | +``` |
| 91 | +mkdir mGEMS-ecoli-reference |
| 92 | +mkdir mGEMS-ecoli-reference/tmp |
| 93 | +build_index --k 31 --input-file mGEMS-ecoli-reference-sequences-v1.0.0.fasta.gz --auto-colors --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --mem-megas 8192 --n-threads 2 |
| 94 | +``` |
| 95 | +change 'ecoli' to 'efaec' or 'saur' if you are trying to reproduce the |
| 96 | +other experiments. |
| 97 | + |
| 98 | +### <a name="mixedsamples"></a>Synthetic mixed samples |
| 99 | +Download the isolate sequencing data and create the synthetic mixed |
| 100 | +samples by concatenating the isolate files |
| 101 | +``` |
| 102 | +## Download the sequencing data and create the samples |
| 103 | +oldid="" |
| 104 | +while read line; do |
| 105 | + id=$(echo $line | cut -f3 -d' ') |
| 106 | + sample=$(echo $line | cut -f1 -d' ') |
| 107 | + scripts/get_forward.sh $sample | gunzip -c >> $id""_1.fastq |
| 108 | + scripts/get_reverse.sh $sample | gunzip -c >> $id""_2.fastq |
| 109 | + if [[ "$id" != "$oldid" ]]; then |
| 110 | + if [ ! -z "$oldid" -a "$oldid" != "" ]; then |
| 111 | + gzip $oldid""_1.fastq |
| 112 | + gzip $oldid""_2.fastq |
| 113 | + fi |
| 114 | + fi |
| 115 | + oldid=$id |
| 116 | +done < mixed_samples.tsv |
| 117 | +gzip $oldid""_1.fastq |
| 118 | +gzip $oldid""_2.fastq |
| 119 | +``` |
| 120 | + |
| 121 | +### <a name="pseudoalignment"></a>Pseudoalignment |
| 122 | +Align the mixed sample files against the index using two threads |
| 123 | +``` |
| 124 | +for f1 in *_1.fastq.gz; do |
| 125 | + f=${f1%_1.fastq.gz} |
| 126 | + f2=$f""_2.fastq.gz |
| 127 | + pseudoalign --query-file $f1 --outfile $f""_1.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output |
| 128 | + pseudoalign --query-file $f2 --outfile $f""_2.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output |
| 129 | +done |
| 130 | +``` |
| 131 | + |
| 132 | +### <a name="estimation"></a>Abundance estimation |
| 133 | +Estimate the relative abundances with mSWEEP and write the results and posterior |
| 134 | +probabilities using two threads |
| 135 | +``` |
| 136 | +for f1 in *_1.fastq.gz; do |
| 137 | + f=${f1%_1.fastq.gz} |
| 138 | + mkdir $f |
| 139 | + mSWEEP --themisto-1 $f""_1.aln.gz --themisto-2 $f""_2.aln.gz --themisto-index mGEMS-ecoli-reference -i mGEMS-ecoli-reference-grouping-v1.0.0.txt -o $f/$f --write-probs --gzip-probs -t 2 |
| 140 | +done |
| 141 | +``` |
| 142 | + |
| 143 | +### <a name="binning"></a>Binning |
| 144 | +Bin the reads with mGEMS and write the binned samples to the |
| 145 | +'ecoli-1' folder. |
| 146 | +``` |
| 147 | +while read line; do |
| 148 | + id=$(echo $line | cut -f3 -d' ') |
| 149 | + cluster=$(echo $line | cut -f2 -d' ') |
| 150 | + mGEMS --groups $cluster -r $id""_1.fastq.gz,$id""_2.fastq.gz --themisto-alns $id""_1.aln.gz,$id""_2.aln.gz -o $id --probs $id/$id""_probs.csv.gz -a $id/$id""_abundances.txt --index mGEMS-ecoli-reference |
| 151 | +done < mixed_samples.tsv |
| 152 | +``` |
| 153 | +Note that by default mGEMS creates bins for **all** reference lineages. If know |
| 154 | +which lineages the samples originate from (in our case these are |
| 155 | +supplied in the mixed_samples.tsv in the second column), the |
| 156 | +'--groups' option enables you to only create those bins. Multiple |
| 157 | +groups can be binned in the single run by supplying them as a |
| 158 | +comma-separated list. |
| 159 | + |
| 160 | +### <a name="assembly"></a>Assembly |
| 161 | +Assemble the sequences with shovill using 2 threads and maximum of |
| 162 | +8192 megabytes of RAM |
| 163 | +``` |
| 164 | +while read line; do |
| 165 | + id=$(echo $line | cut -f3 -d' ') |
| 166 | + cluster=$(echo $line | cut -f2 -d' ') |
| 167 | + shovill --outdir $id/$cluster --R1 $id/$cluster""_1.fastq.gz --R2 $id/$cluster""_2.fastq.gz --cpus 2 --ram 8 |
| 168 | + mv $id/$cluster/contigs.fa $id/ |
| 169 | + rm -rf $id/$cluster |
| 170 | + mkdir $id/$cluster |
| 171 | + mv $id/contigs.fa $id/$cluster/ |
| 172 | +done < mixed_samples.tsv |
| 173 | +``` |
| 174 | + |
| 175 | +### <a name="snpcalling"></a>SNP calling |
| 176 | +Download the reference sequence 'NCTC13441' from the ENA |
| 177 | +``` |
| 178 | +wget -O NCTC13441.fasta.gz http://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/uf/UFZF01.fasta.gz |
| 179 | +``` |
| 180 | +Call SNPs in the genome with snippy |
| 181 | +``` |
| 182 | +mkdir snippy-tmp |
| 183 | +gunzip NCTC13441.fasta.gz |
| 184 | +while read line; do |
| 185 | + id=$(echo $line | cut -f3 -d' ') |
| 186 | + sample=$(echo $line | cut -f1 -d' ') |
| 187 | + cluster=$(echo $line | cut -f2 -d' ') |
| 188 | + snippy --outdir $id/$cluster/$sample --ctgs $id/$cluster/contigs.fa --ref NCTC13441.fasta --cpus 2 --ram 8 --tmpdir snippy-tmp |
| 189 | +done < mixed_samples.tsv |
| 190 | +gzip NCTC13441.fasta |
| 191 | +``` |
| 192 | +Build the core SNP alignment with snippy |
| 193 | +``` |
| 194 | +snippys="" |
| 195 | +while read line; do |
| 196 | + id=$(echo $line | cut -f3 -d' ') |
| 197 | + sample=$(echo $line | cut -f1 -d' ') |
| 198 | + cluster=$(echo $line | cut -f2 -d' ') |
| 199 | + snippys=$snippys""$id/$cluster/$sample" " |
| 200 | + last=$id/$cluster/$sample |
| 201 | +done < mixed_samples.tsv |
| 202 | +snippy-core --ref $last/ref.fa $snippys |
| 203 | +``` |
| 204 | +the alignment will be stored in the 'core.full.aln' file. |
| 205 | + |
| 206 | +### <a name="phylogenetics"></a>Phylogenetic inference |
| 207 | +Use RAxML-NG to infer a maximum likelihood phylogeny from 10 starting |
| 208 | +parsimony and random trees under the GTR+G4 model |
| 209 | +``` |
| 210 | +raxml-ng --search --msa core.full.aln --prefix CT --threads 2 --tree rand{10},pars{10} --model GTR+G4 |
| 211 | +``` |
| 212 | +Calculate bootstrap support values with 100 replicates |
| 213 | +``` |
| 214 | +raxml-ng --bootstrap --msa core.full.aln --bs-trees 100 --prefix CB --threads 2 --model GTR+G4 |
| 215 | +``` |
| 216 | +Perform a bootstrap convergence check |
| 217 | +``` |
| 218 | +raxml-ng --bsconverge --bs-trees CB.raxml.bootstraps --prefix CS --seed 2 --threads 2 |
| 219 | +``` |
| 220 | +Add the bootstrap support values to the maximum likelihood tree with |
| 221 | +the best likelihood |
| 222 | +``` |
| 223 | +raxml-ng --support --tree CT.raxml.bestTree --bs-trees CB.raxml.bootstraps --prefix CS --threads 2 |
| 224 | +``` |
| 225 | +The best tree with the bootstrap values will be written in the |
| 226 | +'CS.raxml.support' file. |
0 commit comments