11# msweep-assembly
22
3- mSWEEP genome assembly plugin code.
3+ mSWEEP binning + assembly plugin code.
44
55# Installation
6+ ## Dependencies
7+ To run the binning + assembly pipeline, you will need a program that
8+ does pseudoalignment and another program that estimates an assignment
9+ probability matrix for the reads to the alignment targets.
10+
11+ We recommend to use [ Themisto] ( https://github.com/jnalanko/themisto )
12+ (v0.1.1 or newer) for pseudoalignment and
13+ [ mSWEEP] ( https://github.com/probic/msweep-assembly ) (v1.3.2 or newer)
14+ for estimating the probability matrix.
15+
616## Compiling from source
717### Requirements
818- C++11 compliant compiler.
919- cmake
1020
1121### Compilation
12- Clone the repository (note the --recursive option in git clone)
22+ Clone the repository (note the * --recursive* option in git clone)
1323```
1424git clone --recursive https://github.com/PROBIC/msweep-assembly.git
1525```
@@ -20,42 +30,62 @@ enter the directory and run
2030> cmake ..
2131> make
2232```
23- This will compile the read_alignment, assign_reads, and build_sample executables in the build/bin/ directory.
24-
33+ This will compile the read_alignment, assign_reads, build_sample, and telescope executables in the build/bin/ directory.
2534
2635# Usage
27- Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with [ Themisto] ( )
36+ ## Indexing
37+ Build a [ Themisto] ( https://github.com/jnalanko/themisto ) index to
38+ align against.
2839```
29- pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
30- pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
40+ mkdir themisto_index
41+ mkdir themisto_index/tmp
42+ build_index --k 31 --input-file example.fasta --auto-colors --index-dir themisto_index --temp-dir themisto_index/tmp
3143```
3244
33- Convert the pseudoalignment to [ kallisto] ( ) format using [ telescope] ( )
45+ Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with Themisto
46+ ```
47+ pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
48+ pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
3449```
50+
51+ Convert the pseudoalignment to
52+ [ kallisto] ( https://github.com/pachterlab/kallisto ) format using
53+ [ telescope] ( https://github.com/tmaklin/telescope ) (supplied with the msweep-assembly installation).
54+ ```
55+ mkdir outfolder
56+
3557ntargets=$(sort themisto_index/coloring-names.txt | uniq | wc -l)
3658telescope --n-refs $ntargets -r pseudoalignments_1.txt,pseudoalignments_2.txt -o outfolder --mode intersection
3759```
3860
39- Create a fake kallisto-style run_info.json file
61+ Create a fake kallisto-style run_info.json file using the
62+ Themisto_run_info.sh script in the root directory of this project
4063```
41- Themisto_run_info.sh $(wc -l outfolder_1 .txt) $ntargets > outfolder/run_info.json
64+ Themisto_run_info.sh $(wc -l < pseudoalignments_1 .txt) $ntargets > outfolder/run_info.json
4265```
4366
4467Determine read assignments to equivalence classes from the kallisto
4568format files
4669```
47- read_alignment -e outfolder/outfolder .ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
70+ read_alignment -e outfolder/pseudoalignments .ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
4871```
4972
50- Estimate the relative abundances with mSWEEP
73+ Estimate the relative abundances with mSWEEP (reference_grouping.txt
74+ should contain the groups the sequences in 'example.fasta' are
75+ assigned to. See the [ mSWEEP] ( https://github.com/probic/msweep-assembly ) usage instructions for details).
5176```
5277mSWEEP -f outfolder -i reference_grouping.txt -o msweep-out --write-probs --gzip-probs
5378```
5479
55- Extract the names of the 3 most abundant reference groups
80+ (Optional) Extract the names of the 3 most abundant reference
81+ groups.
5682```
5783grep -v "^[#]" msweep-out_abundances.txt | sort -rgk2 | cut -f1 | head -n3 > most_abundant_groups.txt
5884```
85+ If you use a more refined method or know which reference groups (as
86+ specified in the reference_grouping.txt file) you want to assemble,
87+ put their names in a .txt file where each line corresponds to a
88+ cluster name instead.
5989
6090Assign reads to the 3 most abundant reference groups based on the estimated probabilities
6191```
@@ -66,6 +96,9 @@ Construct the binned samples from the original files
6696
6797```
6898while read -r sample; do
69- build_sample -a outfolder/$sample\"\ "_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
99+ build_sample -a outfolder/$sample" "_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
70100done < most_abundant_groups.txt
71101```
102+ This will create the <group name >_ 1.fastq.gz and <group
103+ name>_ 2.fastq.gz files in the outfolder, which you can assemble with
104+ your assembler of choice.
0 commit comments