yhoogstrate · yhoogstrate · Feb 16, 2021 · Feb 16, 2021 · Feb 16, 2021 · Feb 16, 2021
diff --git a/.gitignore b/.gitignore
@@ -21,7 +21,6 @@ test.out
 truncated.*
 
 # Other / analysis
-/data
 /samples
 .Rhistory
 *.png
@@ -37,5 +36,6 @@ truncated.*
 .~lock.*
 /classification
 ignore
-/*.bam
+*.bam
 /*.sam
+*.ods
diff --git a/README.md b/README.md
@@ -1,11 +1,31 @@
 # Dr. Disco
 
-### Detecting exon-to-exon and genomic breakpoints in RNA-Seq data
+### Detection of fusion transcripts and their genomic breakpoints from RNA sequencing data
 
 [![Build Status](https://travis-ci.org/yhoogstrate/dr-disco.svg?branch=master)](https://travis-ci.org/yhoogstrate/dr-disco)
 
  - Free software: GNU General Public License v3 (GPLv3)
 
+
+----------
+
+## IMPORTANT NOTE
+
+### Dr. Disco was developed with STAR 2.4. STAR did not assign SA tags back then, which has been resolved over time. The arguments of later versions of STAR have also changed over time. We therefore provide the old instructions (STAR 2.4) and the latest instructions (STAR 2.7). The versions tested with are:
+
+ - STAR 2.4.2 [V:020201]
+ - STAR 2.7.7.a [Mon Dec 28 13:38:40 EST 2020 compiled]
+
+
+### I HAVE A SAM FILE AND FORGOT ABOUT THE STAR VERSION!?!?!?!
+
+```
+samtools view -H Aligned.out.sorted.bam | grep "PN:STAR" | grep -P "VN:[^\\s]+"
+```
+
+----------
+
+
 ## Introduction
 
 Fusion genes are often driver events in several types of cancer, generally detected with DNA-sequencing (DNA-seq) and more recently with RNA-sequencing (RNA-seq).
@@ -52,13 +72,14 @@ sudo pip install -r requirements.txt ;
 sudo python3 setup.py install --user
 ```
 
-### Pre-compiled
+### Docker
 
-Dr. Disco is also available at BioConda but this does not automatically ship with the blacklist files. Also, because of the changes in the built-system of bioconda, this version may be out of sync with the git repo.
+Dr. Disco combined with RNA-STAR 2.7.7a is available for docker at Docker hub:
 
-[![Bio-Conda installer](https://cdn.rawgit.com/yhoogstrate/dr-disco/master/share/bioconda-badge.svg)](share/bioconda-badge.svg)
+https://hub.docker.com/repository/docker/yhoogstrate/dr-disco
 
-### Usage
+
+## Usage
 
 A `dr-disco` pipeline typically consists of the following steps:
 
@@ -69,6 +90,7 @@ A `dr-disco` pipeline typically consists of the following steps:
 5. *Integrate* integrates results from the same genomic event and annotates gene names
 
 #### Check mate pair orentation first
+
 STAR does not seem to be able to appropriately determine Chimeric reads if the orientation of the mates is not default. Hence, before running STAR, make sure whether the strand orientation is `FR` (_R1: forward, _R2: _reverse). If this is not the case, you can use **fastx_reverse_complement** (http://hannonlab.cshl.edu/fastx_toolkit/) to convert to the appropriate strands.
 
  - For *forward*, *reverse* (default) you can proceed with the original _R1 and _R2
@@ -81,11 +103,36 @@ STAR does not seem to be able to appropriately determine Chimeric reads if the o
 The illumina NextSeq is claimed to be responsible for measuring poly-G sequences (https://www.biostars.org/p/294612/) which may result in vast amounts of discordant reads aligned to poly-G regions in the genome.
 It is recommended to clean such datasets and erase poly-G suffixes, for instance with the tool fastp.
 
-### Step 1: STAR
+### Step 0: Build reference index [2.4 + 2.7]
 
-Running RNA-STAR with fusion settings produces a file: ``<...>.Chimeric.out.sam``. This file contains discordant reads (split and spanning). It is recommended to run STAR with corresponding settings:
+```
+./STAR \
+    --runThreadN 7 \
+    --runMode genomeGenerate \
+    --genomeDir index/ \
+    --genomeFastaFiles ../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
+    --sjdbGTFfile ../gencode.v34.primary_assembly.annotation.gtf
+```
+
+### Step 1: STAR [2.7]
+
+```
+nice ./STAR \
+    --runThreadN 8 \
+    --runMode genomeGenerate \
+    --genomeDir index/ \
+    --genomeFastaFiles ../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
+    --sjdbGTFfile ../gencode.v34.primary_assembly.annotation.gtf
+```
+
+### Step 1: STAR [2.4]
+
+Running RNA-STAR with fusion settings produces a file: ``<...>.Chimeric.out.sam``.
+This file contains discordant reads (both split and spanning).
+It is recommended to run STAR with corresponding settings:
 
 ```
+mkdir -p '{$prefix}/'
 STAR --genomeDir ${star_index_dir} \  
      --readFilesIn ${left_fq_filenames} ${right_fq_filenames} \  
      --outFileNamePrefix {$prefix} \
@@ -105,11 +152,48 @@ STAR --genomeDir ${star_index_dir} \
      --twopassMode Basic
 ```
 
-Due to the `chimSegmentMin` and `chimJunctionOverhangMin` settings, STAR shall produce the additional output file(s). This file may need to be converted to BAM format, e.g. by running `samtools view -bhS Chimeric.out.sam > Chimeric.out.bam`. This generated `ChimericSorted.bam` can then be used as input file for *Dr. Disco*. Note that samtools has changed its commandline interface over time and that the stated command might be slighly different for different versions of samtools.
+Due to the `chimSegmentMin` and `chimJunctionOverhangMin` settings, STAR shall produce the additional output file(s).
+This file may need to be converted to BAM format, e.g. by running `samtools view -bhS Chimeric.out.sam > Chimeric.out.bam`.
+This generated `ChimericSorted.bam` can then be used as input file for *Dr. Disco*.
+Note that samtools has changed its commandline interface over time and that the stated command might be slighly different for different versions of samtools.
 
+### Step 1: STAR [2.7]
+
+```
+mkdir -p '{$prefix}/'
+STAR --genomeDir ${star_index_dir} \  
+	--outFileNamePrefix {$prefix} \
+	--runThreadN 7 \
+	--outSAMtype BAM SortedByCoordinate \
+	--outSAMstrandField intronMotif \
+	--outFilterIntronMotifs None \
+	--alignIntronMin 20 \
+	--alignIntronMax 200000 \
+	--alignMatesGapMax 200000 \
+	--alignSJoverhangMin 10 \
+	--alignSJDBoverhangMin 1 \
+	--chimSegmentMin 12 \
+	--chimJunctionOverhangMin 12 \
+	--chimOutType WithinBAM SeparateSAMold
+```
+
+Notice: `--chimOutType WithinBAM SeparateSAMold` **needed** in combination with `chimSegmentMin` and `chimJunctionOverhangMin`.
+
+This will output the chimeric reads to:
+
+ - Chimeric.out.sam [Similar to STAR 2.4]
+ - Aligned.out.sorted.bam [This file does not mark flag spanning reads as chimeric and can thus not be used]
+
+
 ### Step 2: dr-disco fix
 
-The first step of Dr. Disco is fixing the BAM file. Altohugh fixing implies it is broken, the alignment provided by STAR is incompatible with the IGV split view and misses the 'SA:' sam flag. In `dr-disco fix` this is solved and allows a user to view discordant reads in IGV in more detail and by using the spit-view. For the split view it may be convenient to color the reads by subtype (split or spanning, and how the direction of the break is), as the mate and strand are related and discordant mates are usually shifted a bit with respect to the breakpoint. In the fixing steps such annotations are added as read groups. Also, this step ensures proper indexing and sorting. If you have as input file `<...>.Chimeric.out.bam`, you can generate the fixed bam file with *Dr. Disco* as follows:
+The first step of Dr. Disco is fixing the BAM file.
+This step annotates `SA:i:` and `FI:i` tags and adds read groups used by Dr. Disco.
+Adding these small fixes allows a user to view discordant reads in IGV in more detail and by using the spit-view.
+**Tip:** For the split view it may be convenient to color the reads by subtype (split or spanning, and how the direction of the break is),
+as the mate and strand are related and discordant mates are usually shifted a bit with respect to the breakpoint.
+In the fixing steps, such annotations are added as read groups. Also, this step ensures proper indexing and sorting.
+If you have as input file `Chimeric.out.sam`, you can generate the fixed bam file with *Dr. Disco* as follows:
 
 ```
 Usage: dr-disco fix [OPTIONS] INPUT_ALIGNMENT_FILE OUTPUT_ALIGNMENT_FILE
@@ -119,11 +203,11 @@ Options:
   --help               Show this message and exit.
 ```
 
-Here the INPUT_ALIGNMENT_FILE will be `<...>.Chimeric.out.bam` and OUTPUT_ALIGNMENT_FILE will be `<...>.Chimeric.fixed.bam`.
+Here the INPUT_ALIGNMENT_FILE will be `Chimeric.out.sam` and OUTPUT_ALIGNMENT_FILE will be `My-Sample.Chimeric.fixed.bam`.
 Hence, you can generate the fixed bam-file with:
 
 ```
-dr-disco fix '<...>.Chimeric.out.bam' 'sample.fixed.bam'
+dr-disco fix 'Chimeric.out.sam' 'My-Sample.Chimeric.fixed.bam'
 ```
 
 ### Step 3: dr-disco detect
@@ -137,6 +221,7 @@ Options:
   -m, --min-e-score INTEGER  Minimal score to initiate pulling sub-graphs
                              (larger numbers boost performance but result in
                              suboptimal results) [default=8]
+  --help                     Show this message and exit.
 ```
 
 Here the `-m` argument controls the level of merging sub-graphs. If datasets become very large there shall be many subgraphs. However, because this is such a time consuming process, it can be desired to skip some of them. Rule of thumb: for every 10.000.000 reads, increase this value with 1. So for 10.000.000-20.000.000 mate pairs choose 1. So for a dataset of 75.000.000 mate pairs, proceed with:

diff --git a/bin/dr-disco b/bin/dr-disco
@@ -73,7 +73,6 @@ def CLI_subtract_chimeric_reads(temp_dir, input_alignment_file, output_alignment
     alignment_handle.convert(output_alignment_file, temp_dir)
 
 
-
 @CLI.command(name="unfix", short_help="Tries to restore the BAM file")
 @click.option("--temp-dir", "-t", type=click.Path(exists=True), default="/tmp", help="Path in which temp files are stored (default: /tmp)")
 @click.argument('input_alignment_file', type=click.Path(exists=True))

diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1,3 @@
+*.out
+*.tab
+*.junction