update cxxargs and check & fix usage instructions

tmaklin · tmaklin · commit b488d1b223a9 · 2020-02-06T10:21:03.000+02:00
diff --git a/README.md b/README.md
@@ -1,15 +1,25 @@
 # msweep-assembly
 
-mSWEEP genome assembly plugin code.
+mSWEEP binning + assembly plugin code.
 
 # Installation
+## Dependencies
+To run the binning + assembly pipeline, you will need a program that
+does pseudoalignment and another program that estimates an assignment
+probability matrix for the reads to the alignment targets.
+
+We recommend to use [Themisto](https://github.com/jnalanko/themisto)
+(v0.1.1 or newer) for pseudoalignment and
+[mSWEEP](https://github.com/probic/msweep-assembly) (v1.3.2 or newer)
+for estimating the probability matrix.
+
 ## Compiling from source
 ### Requirements
 - C++11 compliant compiler.
 - cmake
 
 ### Compilation
-Clone the repository (note the --recursive option in git clone)
+Clone the repository (note the *--recursive* option in git clone)
 ```
 git clone --recursive https://github.com/PROBIC/msweep-assembly.git
 ```
@@ -20,42 +30,62 @@ enter the directory and run
 > cmake ..
 > make
 ```
-This will compile the read_alignment, assign_reads, and build_sample executables in the build/bin/ directory.
-
+This will compile the read_alignment, assign_reads, build_sample, and telescope executables in the build/bin/ directory.
 
 # Usage
-Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with [Themisto]()
+## Indexing
+Build a [Themisto](https://github.com/jnalanko/themisto) index to
+align against.
 ```
-pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
-pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
+mkdir themisto_index
+mkdir themisto_index/tmp
+build_index --k 31 --input-file example.fasta --auto-colors --index-dir themisto_index --temp-dir themisto_index/tmp
 ```
 
-Convert the pseudoalignment to [kallisto]() format using [telescope]()
+Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with Themisto
+```
+pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
+pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
 ```
+
+Convert the pseudoalignment to
+[kallisto](https://github.com/pachterlab/kallisto) format using
+[telescope](https://github.com/tmaklin/telescope) (supplied with the msweep-assembly installation).
+```
+mkdir outfolder
+
 ntargets=$(sort themisto_index/coloring-names.txt | uniq | wc -l)
 telescope --n-refs $ntargets -r pseudoalignments_1.txt,pseudoalignments_2.txt -o outfolder --mode intersection
 ```
 
-Create a fake kallisto-style run_info.json file
+Create a fake kallisto-style run_info.json file using the
+Themisto_run_info.sh script in the root directory of this project
 ```
-Themisto_run_info.sh $(wc -l outfolder_1.txt) $ntargets > outfolder/run_info.json
+Themisto_run_info.sh $(wc -l < pseudoalignments_1.txt) $ntargets > outfolder/run_info.json
 ```
 
 Determine read assignments to equivalence classes from the kallisto
 format files
 ```
-read_alignment -e outfolder/outfolder.ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
+read_alignment -e outfolder/pseudoalignments.ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
 ```
 
-Estimate the relative abundances with mSWEEP
+Estimate the relative abundances with mSWEEP (reference_grouping.txt
+should contain the groups the sequences in 'example.fasta' are
+assigned to. See the [mSWEEP](https://github.com/probic/msweep-assembly) usage instructions for details).
 ```
 mSWEEP -f outfolder -i reference_grouping.txt -o msweep-out --write-probs --gzip-probs
 ```
 
-Extract the names of the 3 most abundant reference groups
+(Optional) Extract the names of the 3 most abundant reference
+groups.
 ```
 grep -v "^[#]" msweep-out_abundances.txt | sort -rgk2 | cut -f1 | head -n3 > most_abundant_groups.txt
 ```
+If you use a more refined method or know which reference groups (as
+specified in the reference_grouping.txt file) you want to assemble,
+put their names in a .txt file where each line corresponds to a
+cluster name instead.
 
 Assign reads to the 3 most abundant reference groups based on the estimated probabilities
 ```
@@ -66,6 +96,9 @@ Construct the binned samples from the original files
 
 ```
 while read -r sample; do
-	build_sample -a outfolder/$sample\"\"_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
+	build_sample -a outfolder/$sample""_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
 done < most_abundant_groups.txt
 ```
+This will create the <group name>_1.fastq.gz and <group
+name>_2.fastq.gz files in the outfolder, which you can assemble with
+your assembler of choice.
diff --git a/Themisto_run_info.sh b/Themisto_run_info.sh
@@ -0,0 +1,9 @@
+echo "{
+	"n_targets": $2,
+	"n_bootstraps": 0,
+	"n_processed": $1,
+	"kallisto_version": "0.43.1",
+	"index_version": 10,
+	"start_time": "Tue Nov  5 16:19:25 2019",
+	"call": "/proj/temaklin/kallisto/kallisto pseudo -i /wrk/users/temaklin/reference_msweep_preprint_all_removed -o /wrk/users/temaklin/splits/ERR434699 /wrk/users/temaklin/msweep_reads/reads/ERR434699_1.fastq.gz /wrk/users/temaklin/msweep_reads/reads/ERR434699_2.fastq.gz"
+}"
diff --git a/external/cxxargs b/external/cxxargs
@@ -1 +1 @@
-Subproject commit a8f2b14a0d9e275152a4cd2621f9a9361b2b143d
+Subproject commit ef6a4f2eee07d3389baa59027052fb6c0b89270a