Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ test.out
truncated.*

# Other / analysis
/data
/samples
.Rhistory
*.png
Expand All @@ -37,5 +36,6 @@ truncated.*
.~lock.*
/classification
ignore
/*.bam
*.bam
/*.sam
*.ods
107 changes: 96 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,31 @@
# Dr. Disco

### Detecting exon-to-exon and genomic breakpoints in RNA-Seq data
### Detection of fusion transcripts and their genomic breakpoints from RNA sequencing data

[![Build Status](https://travis-ci.org/yhoogstrate/dr-disco.svg?branch=master)](https://travis-ci.org/yhoogstrate/dr-disco)

- Free software: GNU General Public License v3 (GPLv3)


----------

## IMPORTANT NOTE

### Dr. Disco was developed with STAR 2.4. STAR did not assign SA tags back then, which has been resolved over time. The arguments of later versions of STAR have also changed over time. We therefore provide the old instructions (STAR 2.4) and the latest instructions (STAR 2.7). The versions tested with are:

- STAR 2.4.2 [V:020201]
- STAR 2.7.7.a [Mon Dec 28 13:38:40 EST 2020 compiled]


### I HAVE A SAM FILE AND FORGOT ABOUT THE STAR VERSION!?!?!?!

```
samtools view -H Aligned.out.sorted.bam | grep "PN:STAR" | grep -P "VN:[^\\s]+"
```

----------


## Introduction

Fusion genes are often driver events in several types of cancer, generally detected with DNA-sequencing (DNA-seq) and more recently with RNA-sequencing (RNA-seq).
Expand Down Expand Up @@ -52,13 +72,14 @@ sudo pip install -r requirements.txt ;
sudo python3 setup.py install --user
```

### Pre-compiled
### Docker

Dr. Disco is also available at BioConda but this does not automatically ship with the blacklist files. Also, because of the changes in the built-system of bioconda, this version may be out of sync with the git repo.
Dr. Disco combined with RNA-STAR 2.7.7a is available for docker at Docker hub:

[![Bio-Conda installer](https://cdn.rawgit.com/yhoogstrate/dr-disco/master/share/bioconda-badge.svg)](share/bioconda-badge.svg)
https://hub.docker.com/repository/docker/yhoogstrate/dr-disco

### Usage

## Usage

A `dr-disco` pipeline typically consists of the following steps:

Expand All @@ -69,6 +90,7 @@ A `dr-disco` pipeline typically consists of the following steps:
5. *Integrate* integrates results from the same genomic event and annotates gene names

#### Check mate pair orentation first

STAR does not seem to be able to appropriately determine Chimeric reads if the orientation of the mates is not default. Hence, before running STAR, make sure whether the strand orientation is `FR` (_R1: forward, _R2: _reverse). If this is not the case, you can use **fastx_reverse_complement** (http://hannonlab.cshl.edu/fastx_toolkit/) to convert to the appropriate strands.

- For *forward*, *reverse* (default) you can proceed with the original _R1 and _R2
Expand All @@ -81,11 +103,36 @@ STAR does not seem to be able to appropriately determine Chimeric reads if the o
The illumina NextSeq is claimed to be responsible for measuring poly-G sequences (https://www.biostars.org/p/294612/) which may result in vast amounts of discordant reads aligned to poly-G regions in the genome.
It is recommended to clean such datasets and erase poly-G suffixes, for instance with the tool fastp.

### Step 1: STAR
### Step 0: Build reference index [2.4 + 2.7]

Running RNA-STAR with fusion settings produces a file: ``<...>.Chimeric.out.sam``. This file contains discordant reads (split and spanning). It is recommended to run STAR with corresponding settings:
```
./STAR \
--runThreadN 7 \
--runMode genomeGenerate \
--genomeDir index/ \
--genomeFastaFiles ../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
--sjdbGTFfile ../gencode.v34.primary_assembly.annotation.gtf
```

### Step 1: STAR [2.7]

```
nice ./STAR \
--runThreadN 8 \
--runMode genomeGenerate \
--genomeDir index/ \
--genomeFastaFiles ../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
--sjdbGTFfile ../gencode.v34.primary_assembly.annotation.gtf
```

### Step 1: STAR [2.4]

Running RNA-STAR with fusion settings produces a file: ``<...>.Chimeric.out.sam``.
This file contains discordant reads (both split and spanning).
It is recommended to run STAR with corresponding settings:

```
mkdir -p '{$prefix}/'
STAR --genomeDir ${star_index_dir} \
--readFilesIn ${left_fq_filenames} ${right_fq_filenames} \
--outFileNamePrefix {$prefix} \
Expand All @@ -105,11 +152,48 @@ STAR --genomeDir ${star_index_dir} \
--twopassMode Basic
```

Due to the `chimSegmentMin` and `chimJunctionOverhangMin` settings, STAR shall produce the additional output file(s). This file may need to be converted to BAM format, e.g. by running `samtools view -bhS Chimeric.out.sam > Chimeric.out.bam`. This generated `ChimericSorted.bam` can then be used as input file for *Dr. Disco*. Note that samtools has changed its commandline interface over time and that the stated command might be slighly different for different versions of samtools.
Due to the `chimSegmentMin` and `chimJunctionOverhangMin` settings, STAR shall produce the additional output file(s).
This file may need to be converted to BAM format, e.g. by running `samtools view -bhS Chimeric.out.sam > Chimeric.out.bam`.
This generated `ChimericSorted.bam` can then be used as input file for *Dr. Disco*.
Note that samtools has changed its commandline interface over time and that the stated command might be slighly different for different versions of samtools.

### Step 1: STAR [2.7]

```
mkdir -p '{$prefix}/'
STAR --genomeDir ${star_index_dir} \
--outFileNamePrefix {$prefix} \
--runThreadN 7 \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignIntronMin 20 \
--alignIntronMax 200000 \
--alignMatesGapMax 200000 \
--alignSJoverhangMin 10 \
--alignSJDBoverhangMin 1 \
--chimSegmentMin 12 \
--chimJunctionOverhangMin 12 \
--chimOutType WithinBAM SeparateSAMold
```

Notice: `--chimOutType WithinBAM SeparateSAMold` **needed** in combination with `chimSegmentMin` and `chimJunctionOverhangMin`.

This will output the chimeric reads to:

- Chimeric.out.sam [Similar to STAR 2.4]
- Aligned.out.sorted.bam [This file does not mark flag spanning reads as chimeric and can thus not be used]


### Step 2: dr-disco fix

The first step of Dr. Disco is fixing the BAM file. Altohugh fixing implies it is broken, the alignment provided by STAR is incompatible with the IGV split view and misses the 'SA:' sam flag. In `dr-disco fix` this is solved and allows a user to view discordant reads in IGV in more detail and by using the spit-view. For the split view it may be convenient to color the reads by subtype (split or spanning, and how the direction of the break is), as the mate and strand are related and discordant mates are usually shifted a bit with respect to the breakpoint. In the fixing steps such annotations are added as read groups. Also, this step ensures proper indexing and sorting. If you have as input file `<...>.Chimeric.out.bam`, you can generate the fixed bam file with *Dr. Disco* as follows:
The first step of Dr. Disco is fixing the BAM file.
This step annotates `SA:i:` and `FI:i` tags and adds read groups used by Dr. Disco.
Adding these small fixes allows a user to view discordant reads in IGV in more detail and by using the spit-view.
**Tip:** For the split view it may be convenient to color the reads by subtype (split or spanning, and how the direction of the break is),
as the mate and strand are related and discordant mates are usually shifted a bit with respect to the breakpoint.
In the fixing steps, such annotations are added as read groups. Also, this step ensures proper indexing and sorting.
If you have as input file `Chimeric.out.sam`, you can generate the fixed bam file with *Dr. Disco* as follows:

```
Usage: dr-disco fix [OPTIONS] INPUT_ALIGNMENT_FILE OUTPUT_ALIGNMENT_FILE
Expand All @@ -119,11 +203,11 @@ Options:
--help Show this message and exit.
```

Here the INPUT_ALIGNMENT_FILE will be `<...>.Chimeric.out.bam` and OUTPUT_ALIGNMENT_FILE will be `<...>.Chimeric.fixed.bam`.
Here the INPUT_ALIGNMENT_FILE will be `Chimeric.out.sam` and OUTPUT_ALIGNMENT_FILE will be `My-Sample.Chimeric.fixed.bam`.
Hence, you can generate the fixed bam-file with:

```
dr-disco fix '<...>.Chimeric.out.bam' 'sample.fixed.bam'
dr-disco fix 'Chimeric.out.sam' 'My-Sample.Chimeric.fixed.bam'
```

### Step 3: dr-disco detect
Expand All @@ -137,6 +221,7 @@ Options:
-m, --min-e-score INTEGER Minimal score to initiate pulling sub-graphs
(larger numbers boost performance but result in
suboptimal results) [default=8]
--help Show this message and exit.
```

Here the `-m` argument controls the level of merging sub-graphs. If datasets become very large there shall be many subgraphs. However, because this is such a time consuming process, it can be desired to skip some of them. Rule of thumb: for every 10.000.000 reads, increase this value with 1. So for 10.000.000-20.000.000 mate pairs choose 1. So for a dataset of 75.000.000 mate pairs, proceed with:
Expand Down
1 change: 0 additions & 1 deletion bin/dr-disco
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ def CLI_subtract_chimeric_reads(temp_dir, input_alignment_file, output_alignment
alignment_handle.convert(output_alignment_file, temp_dir)



@CLI.command(name="unfix", short_help="Tries to restore the BAM file")
@click.option("--temp-dir", "-t", type=click.Path(exists=True), default="/tmp", help="Path in which temp files are stored (default: /tmp)")
@click.argument('input_alignment_file', type=click.Path(exists=True))
Expand Down
3 changes: 3 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.out
*.tab
*.junction
Loading