Skip to content

Commit 30f573e

Browse files
authored
Merge pull request #8 from PROBIC/docs
Documentation & input filter
2 parents b83450b + 3aec6d9 commit 30f573e

File tree

5 files changed

+305
-11
lines changed

5 files changed

+305
-11
lines changed

README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,12 @@ mGEMS bin and mGEMS extract, which bin the reads in the input
4040
pseudoalignment (mGEMS bin) and extract the binned reads from the
4141
original mixed samples (mGEMS extract).
4242

43-
### (Pseudo)tutorial — Full pipeline with Themisto and mSWEEP
43+
### Tutorial — E. coli ST131 sublineages
44+
A tutorial for reproducing the *E. coli* ST131 sublineage phylogenetic
45+
tree presented in Mäklin et al. 2020 using mGEMS is available in the
46+
[docs folder of this repository](docs/TUTORIAL.md).
47+
48+
### Quickstart — full pipeline
4449
Build a [Themisto](https://github.com/algbio/themisto) index to
4550
align against.
4651
```
@@ -49,23 +54,23 @@ mkdir themisto_index/tmp
4954
build_index --k 31 --input-file example.fasta --auto-colors --index-dir themisto_index --temp-dir themisto_index/tmp
5055
```
5156

52-
Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with Themisto
57+
Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with Themisto (note the **--sort-output** flag must be used!)
5358
```
54-
pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.aln --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192 --sort-output
55-
pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.aln --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192 --sort-output
59+
pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.aln --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192 --sort-output --gzip-output
60+
pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.aln --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192 --sort-output --gzip-output
5661
```
5762

5863
Estimate the relative abundances with mSWEEP (reference_grouping.txt
5964
should contain the groups the sequences in 'example.fasta' are
6065
assigned to. See the [mSWEEP](https://github.com/probic/msweep-assembly) usage instructions for details).
6166
```
62-
mSWEEP --themisto-1 pseudoalignments_1.aln --themisto-2 pseudoalignments_2.aln -o mSWEEP -i reference_grouping.txt --write-probs
67+
mSWEEP --themisto-1 pseudoalignments_1.aln.gz --themisto-2 pseudoalignments_2.aln.gz -o mSWEEP -i reference_grouping.txt --write-probs
6368
```
6469

6570
Bin the reads and write all bins to the 'mGEMS-out' folder
6671
```
6772
mkdir mGEMS-out
68-
mGEMS -r reads_1.fastq.gz,reads_2.fastq.gz --themisto-alns pseudoalignments_1.txt,pseudoalignments_2.txt -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
73+
mGEMS -r reads_1.fastq.gz,reads_2.fastq.gz --themisto-alns pseudoalignments_1.aln.gz,pseudoalignments_2.aln.gz -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
6974
```
7075
This will write the binned paired-end reads for *all groups* in the
7176
mSWEEP_abundances.txt file in the mGEMS-out folder (compressed with
@@ -75,13 +80,13 @@ zlib).
7580
... or bin and write only the reads that are assigned to "group-3" or
7681
"group-4" by adding the '--groups group-3,group-4' flag
7782
```
78-
mGEMS --groups group-3,group-4 -r reads_1.fastq.gz,reads_2.fastq.gz --themisto-alns pseudoalignments_1.txt,pseudoalignments_2.txt -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
83+
mGEMS --groups group-3,group-4 -r reads_1.fastq.gz,reads_2.fastq.gz --themisto-alns pseudoalignments_1.aln.gz,pseudoalignments_2.aln.gz -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
7984
```
8085

8186
Alternatively, find and write only the read bins for "group-3" and
8287
"group-4", skipping extracting the reads
8388
```
84-
mGEMS bin --groups group-3,group-4 --themisto-alns pseudoalignments_1.txt,pseudoalignments_2.txt -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
89+
mGEMS bin --groups group-3,group-4 --themisto-alns pseudoalignments_1.aln.gz,pseudoalignments_2.aln.gz -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index
8590
```
8691

8792
... and extract the reads when feeling like it
@@ -102,9 +107,9 @@ mGEMS accepts the following input flags
102107
-a Relative abundance estimates from mSWEEP (tab-separated, 1st
103108
column has the group names and 2nd column the estimates).
104109
--index Themisto pseudoalignment index directory.
105-
--groups (Optional) which groups to extract from the input reads.
106-
--compress (Optional) Toggle compressing the output files
107-
(default: compress)
110+
--groups (Optional) Which groups to extract from the input reads.
111+
--min-abundance (Optional) Extract only groups that have a relative abundance higher than this value.
112+
--compress (Optional) Toggle compressing the output files (default: compress)
108113
```
109114

110115

docs/TUTORIAL.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Genomic epidemiology with mixed samples: a tutorial
2+
This tutorial constains instructions on how to repproduce the results
3+
of the three main synthetic experiments from *Genomic epidemiology
4+
with mixed samples*, Mäklin et al. 2020, in preparation.
5+
6+
The tutorial will focus on reproducing the *Escherichia coli*
7+
experiment but contains instructions on how to adapt the scripts to
8+
the *Enterococcus faecalis* and *Staphylococcus aureus* experiments.
9+
10+
For quick instructions on how to run the pipeline in a general
11+
setting, please refer to the README.md file in the root of this
12+
repository.
13+
14+
## Requirements
15+
### mGEMS pipeline
16+
- [Themisto](https://github.com/algbio/Themisto)
17+
- [mSWEEP](https://github.com/probic/mSWEEP)
18+
- [mGEMS](https://github.com/probic/mGEMS)
19+
- [shovill](https://github.com/tseemann/shovill/)
20+
21+
### Phylogenetic analysis
22+
- [snippy](https://github.com/tseemann/snippy/)
23+
- [RAxML-NG](https://github.com/amkozlov/raxml-ng)
24+
25+
### Extra (macOS only)
26+
#### GNU coreutils
27+
On a macOS system, you'll also need to install GNU coreutils from
28+
homebrew and alias the macOS zcat command to the GNU zcat command for
29+
the duration of the session
30+
```
31+
brew install coreutils
32+
alias zcat=gzcat
33+
ulimit -n 2048
34+
```
35+
#### Concurrent file connections limit
36+
macOS also limits the number of concurrent file connections, which
37+
will have to be increased to run Themisto and shovill
38+
```
39+
ulimit -n 2048
40+
```
41+
42+
## Tutorial
43+
### Table of Contents
44+
45+
- [Select a species](#selectspecies)
46+
- [Reference data](#referencedata)
47+
- [Synthetic mixed samples](#mixedsamples)
48+
- [Indexing](#indexing)
49+
- [Pseudoalignment](#pseudoalignment)
50+
- [Abundance estimation](#estimation)
51+
- [Binning](#binning)
52+
- [Assembly](#assembly)
53+
- [SNP calling](#snpcalling)
54+
- [Phylogenetic inference](#phylogenetics)
55+
56+
57+
### <a name="selectspecies"></a>Select a species
58+
Download the supplementary table from the mGEMS manucsript which
59+
contains the relevant information
60+
```
61+
wget https://zenodo.org/record/3724144/files/mGEMS_Supplementary_Table_mixed_samples.tsv
62+
```
63+
Filter the table to contain only the *E. coli* (ecoli) experiments
64+
```
65+
grep "ecoli" mGEMS_Supplementary_Table_mixed_samples.tsv" > mixed_samples.tsv
66+
```
67+
If you want to reproduce the *E. faecalis* experiments, change 'ecoli'
68+
to 'efaec'. For *S. aureus*, change 'ecoli' to 'saur'. Running these
69+
other two experiments may require resources beyond the typical laptop or
70+
desktop computer.
71+
72+
### <a name="referencedata"></a>Reference data
73+
The reference data from Mäklin et al. is available from zenodo
74+
- [*E. coli*](https://zenodo.org/record/3724112)
75+
- [*E. faecalis*](https://zenodo.org/record/3724101)
76+
- [*S. aureus*](https://zenodo.org/record/3724135)
77+
78+
Construction of the reference dataset(s) is describe in more detail in
79+
Mäklin et al. 2020.
80+
81+
Download and extract the *E. coli* dataset by running
82+
```
83+
wget https://zenodo.org/record/3724112/files/mGEMS-ecoli-reference-v1.0.0.tar.gz
84+
tar -zxvf mGEMS-ecoli-reference-v1.0.0.tar.gz
85+
```
86+
87+
### <a name="indexing"></a>Indexing
88+
Create a *31*-mer pseudoalignment index with Themisto using two
89+
threads and maximum 8192 megabytes of RAM.
90+
```
91+
mkdir mGEMS-ecoli-reference
92+
mkdir mGEMS-ecoli-reference/tmp
93+
build_index --k 31 --input-file mGEMS-ecoli-reference-sequences-v1.0.0.fasta.gz --auto-colors --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --mem-megas 8192 --n-threads 2
94+
```
95+
change 'ecoli' to 'efaec' or 'saur' if you are trying to reproduce the
96+
other experiments.
97+
98+
### <a name="mixedsamples"></a>Synthetic mixed samples
99+
Download the isolate sequencing data and create the synthetic mixed
100+
samples by concatenating the isolate files
101+
```
102+
## Download the sequencing data and create the samples
103+
oldid=""
104+
while read line; do
105+
id=$(echo $line | cut -f3 -d' ')
106+
sample=$(echo $line | cut -f1 -d' ')
107+
scripts/get_forward.sh $sample | gunzip -c >> $id""_1.fastq
108+
scripts/get_reverse.sh $sample | gunzip -c >> $id""_2.fastq
109+
if [[ "$id" != "$oldid" ]]; then
110+
if [ ! -z "$oldid" -a "$oldid" != "" ]; then
111+
gzip $oldid""_1.fastq
112+
gzip $oldid""_2.fastq
113+
fi
114+
fi
115+
oldid=$id
116+
done < mixed_samples.tsv
117+
gzip $oldid""_1.fastq
118+
gzip $oldid""_2.fastq
119+
```
120+
121+
### <a name="pseudoalignment"></a>Pseudoalignment
122+
Align the mixed sample files against the index using two threads
123+
```
124+
for f1 in *_1.fastq.gz; do
125+
f=${f1%_1.fastq.gz}
126+
f2=$f""_2.fastq.gz
127+
pseudoalign --query-file $f1 --outfile $f""_1.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
128+
pseudoalign --query-file $f2 --outfile $f""_2.aln --index-dir mGEMS-ecoli-reference --temp-dir mGEMS-ecoli-reference/tmp --n-threads 2 --rc --sort-output --gzip-output
129+
done
130+
```
131+
132+
### <a name="estimation"></a>Abundance estimation
133+
Estimate the relative abundances with mSWEEP and write the results and posterior
134+
probabilities using two threads
135+
```
136+
for f1 in *_1.fastq.gz; do
137+
f=${f1%_1.fastq.gz}
138+
mkdir $f
139+
mSWEEP --themisto-1 $f""_1.aln.gz --themisto-2 $f""_2.aln.gz --themisto-index mGEMS-ecoli-reference -i mGEMS-ecoli-reference-grouping-v1.0.0.txt -o $f/$f --write-probs --gzip-probs -t 2
140+
done
141+
```
142+
143+
### <a name="binning"></a>Binning
144+
Bin the reads with mGEMS and write the binned samples to the
145+
'ecoli-1' folder.
146+
```
147+
while read line; do
148+
id=$(echo $line | cut -f3 -d' ')
149+
cluster=$(echo $line | cut -f2 -d' ')
150+
mGEMS --groups $cluster -r $id""_1.fastq.gz,$id""_2.fastq.gz --themisto-alns $id""_1.aln.gz,$id""_2.aln.gz -o $id --probs $id/$id""_probs.csv.gz -a $id/$id""_abundances.txt --index mGEMS-ecoli-reference
151+
done < mixed_samples.tsv
152+
```
153+
Note that by default mGEMS creates bins for **all** reference lineages. If know
154+
which lineages the samples originate from (in our case these are
155+
supplied in the mixed_samples.tsv in the second column), the
156+
'--groups' option enables you to only create those bins. Multiple
157+
groups can be binned in the single run by supplying them as a
158+
comma-separated list.
159+
160+
### <a name="assembly"></a>Assembly
161+
Assemble the sequences with shovill using 2 threads and maximum of
162+
8192 megabytes of RAM
163+
```
164+
while read line; do
165+
id=$(echo $line | cut -f3 -d' ')
166+
cluster=$(echo $line | cut -f2 -d' ')
167+
shovill --outdir $id/$cluster --R1 $id/$cluster""_1.fastq.gz --R2 $id/$cluster""_2.fastq.gz --cpus 2 --ram 8
168+
mv $id/$cluster/contigs.fa $id/
169+
rm -rf $id/$cluster
170+
mkdir $id/$cluster
171+
mv $id/contigs.fa $id/$cluster/
172+
done < mixed_samples.tsv
173+
```
174+
175+
### <a name="snpcalling"></a>SNP calling
176+
Download the reference sequence 'NCTC13441' from the ENA
177+
```
178+
wget -O NCTC13441.fasta.gz http://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/uf/UFZF01.fasta.gz
179+
```
180+
Call SNPs in the genome with snippy
181+
```
182+
mkdir snippy-tmp
183+
gunzip NCTC13441.fasta.gz
184+
while read line; do
185+
id=$(echo $line | cut -f3 -d' ')
186+
sample=$(echo $line | cut -f1 -d' ')
187+
cluster=$(echo $line | cut -f2 -d' ')
188+
snippy --outdir $id/$cluster/$sample --ctgs $id/$cluster/contigs.fa --ref NCTC13441.fasta --cpus 2 --ram 8 --tmpdir snippy-tmp
189+
done < mixed_samples.tsv
190+
gzip NCTC13441.fasta
191+
```
192+
Build the core SNP alignment with snippy
193+
```
194+
snippys=""
195+
while read line; do
196+
id=$(echo $line | cut -f3 -d' ')
197+
sample=$(echo $line | cut -f1 -d' ')
198+
cluster=$(echo $line | cut -f2 -d' ')
199+
snippys=$snippys""$id/$cluster/$sample" "
200+
last=$id/$cluster/$sample
201+
done < mixed_samples.tsv
202+
snippy-core --ref $last/ref.fa $snippys
203+
```
204+
the alignment will be stored in the 'core.full.aln' file.
205+
206+
### <a name="phylogenetics"></a>Phylogenetic inference
207+
Use RAxML-NG to infer a maximum likelihood phylogeny from 10 starting
208+
parsimony and random trees under the GTR+G4 model
209+
```
210+
raxml-ng --search --msa core.full.aln --prefix CT --threads 2 --tree rand{10},pars{10} --model GTR+G4
211+
```
212+
Calculate bootstrap support values with 100 replicates
213+
```
214+
raxml-ng --bootstrap --msa core.full.aln --bs-trees 100 --prefix CB --threads 2 --model GTR+G4
215+
```
216+
Perform a bootstrap convergence check
217+
```
218+
raxml-ng --bsconverge --bs-trees CB.raxml.bootstraps --prefix CS --seed 2 --threads 2
219+
```
220+
Add the bootstrap support values to the maximum likelihood tree with
221+
the best likelihood
222+
```
223+
raxml-ng --support --tree CT.raxml.bestTree --bs-trees CB.raxml.bootstraps --prefix CS --threads 2
224+
```
225+
The best tree with the bootstrap values will be written in the
226+
'CS.raxml.support' file.

docs/scripts/get_forward.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
ftppath="ftp://ftp.sra.ebi.ac.uk/vol1/fastq"
2+
3+
strlen=$((${#1}))
4+
dir1="/"${1:0:6}
5+
6+
if [ ${strlen} -lt 10 ]
7+
then
8+
dir2="/"
9+
elif [ ${strlen} -lt 11 ]
10+
then
11+
dir2="/00"${1: -1}"/"
12+
elif [ ${strlen} -lt 12 ]
13+
then
14+
dir2="/0"${1: -2}"/"
15+
elif [ ${strlen} -lt 13 ]
16+
then
17+
dir2="/"${1: -3}"/"
18+
else
19+
echo "check accession number"
20+
fi
21+
22+
dlpath=$ftppath$dir1$dir2$1"/"
23+
24+
wget -qO- $dlpath""$1_1.fastq.gz

docs/scripts/get_reverse.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
ftppath="ftp://ftp.sra.ebi.ac.uk/vol1/fastq"
2+
3+
strlen=$((${#1}))
4+
dir1="/"${1:0:6}
5+
6+
if [ ${strlen} -lt 10 ]
7+
then
8+
dir2="/"
9+
elif [ ${strlen} -lt 11 ]
10+
then
11+
dir2="/00"${1: -1}"/"
12+
elif [ ${strlen} -lt 12 ]
13+
then
14+
dir2="/0"${1: -2}"/"
15+
elif [ ${strlen} -lt 13 ]
16+
then
17+
dir2="/"${1: -3}"/"
18+
else
19+
echo "check accession number"
20+
fi
21+
22+
dlpath=$ftppath$dir1$dir2$1"/"
23+
24+
wget -qO- $dlpath""$1_2.fastq.gz

src/main.cpp

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,11 @@ void ParseBin(int argc, char* argv[], cxxargs::Arguments &args) {
3939
args.add_long_argument<std::string>("probs", "Posterior probabilities from mSWEEP.");
4040
args.add_long_argument<std::string>("merge-mode", "How to merge paired-end alignments from Themisto (default: intersection).", "intersection");
4141
args.add_long_argument<std::vector<std::string>>("groups", "Which reference groups to bin reads to (default: all).");
42+
args.add_long_argument<long double>("min-abundance", "Bin only the groups that have a relative abundance higher than this value (optional).");
4243
args.add_short_argument<long double>('q', "Tuning parameter for the binning thresholds (default: 1.0).", (long double)1);
4344
args.add_long_argument<std::string>("index", "Themisto pseudoalignment index directory.");
4445
args.set_not_required("groups");
46+
args.set_not_required("min-abundance");
4547

4648
args.parse(argc, argv);
4749
}
@@ -91,6 +93,15 @@ void ReadAndExtract(cxxargs::Arguments &args) {
9193
Extract(bins, target_groups, args);
9294
}
9395

96+
void FilterTargetGroups(const std::vector<std::string> &group_names, const std::vector<long double> &abundances, const long double min_abundance, std::vector<std::string> *target_groups) {
97+
uint32_t n_groups = group_names.size();
98+
for (uint32_t i = 0; i < n_groups; ++i) {
99+
if (abundances[i] < min_abundance && std::find(target_groups->begin(), target_groups->end(), group_names[i]) != target_groups->end()) {
100+
target_groups->erase(std::find(target_groups->begin(), target_groups->end(), group_names[i]));
101+
}
102+
}
103+
}
104+
94105
void Bin(const cxxargs::Arguments &args, bool extract_bins) {
95106
DIR* dir = opendir(args.value<std::string>('o').c_str());
96107
if (dir) {
@@ -128,6 +139,10 @@ void Bin(const cxxargs::Arguments &args, bool extract_bins) {
128139
} else {
129140
target_groups = groups;
130141
}
142+
if (args.is_initialized("min-abundance")) {
143+
FilterTargetGroups(groups, abundances, args.value<long double>("min-abundance"), &target_groups);
144+
}
145+
131146
const std::vector<std::vector<uint32_t>> &bins = mGEMS::Bin(aln, args.value<long double>('q'), abundances, groups, probs_file.stream(), &target_groups);
132147
if (!extract_bins) {
133148
uint32_t n_bins = bins.size();

0 commit comments

Comments
 (0)