Bulk-RNA-Snakeline

Motivation

Due to the rapid advancements in sequencing technology, researchers are now able to generate massive amounts of biological data through an increased number of samples and more affordable options. This has led to a growing demand for simple, efficient methods to process and analyze large datasets, ultimately transforming them into meaningful and reproducible information. Workflow engines offer a valuable solution to this challenge, as they streamline and automate processing tasks, thus reducing the risk of user bias and errors that may arise from manual procedures.

Recognizing the need to adapt to the ever-increasing volume of data inputs and the constant evolution of processing software, the Bioinformatics Core team at the Allen Institute (BiCore) has begun transitioning towards automated workflows. In particular, they are employing Snakemake, a powerful workflow engine, to facilitate the quality assessment, trimming, and mapping of Bulk RNA-Sequencing (RNA-Seq) data. By embracing automated workflows, the BiCore team aims to improve efficiency, consistency, and reproducibility in their research, ultimately enhancing the overall quality of their findings.

Quickstart Guide

Follow these steps to use the Bulk-RNA-Snakeline:

Download the repository (.zip), move it to your working directory, and unzip it

Create and load Conda environment with all dependencies:

conda env create --name snakeline_env -f envs/Bulk-RNA-Snakeline.yml

Activate the Conda environment:
```
conda activate snakeline_env
```
Move RAW Fastq Files into Bulk-RNA-Snakeline folder.
Prepare the pipeline by creating directory structure:
```
python3 setup.py
```
Or if sample_list.txt is supplied:
```
python3 setup.py -s <name_of_sample_file>
```
Adjust parameters in config.yml:
```
nano config/config.yml
```

Execute snakemake and run the workflow:

snakemake --cores 160 -s <snakefile>

Or using Slurm (optional):

srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk

sbatch run.sh

Troubleshooting common errors:

A raised LockException:
```
rm .snakemake/locks/*
```
Directory cannot be locked:
```
snakemake -s main.smk --unlock
```

Incomplete Run:

srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk --latency-wait 60 --rerun-incomplete

sbatch rerun.sh

Note: This pipeline will take a long time depending on the data and number of cores available.

Required Tools

FastQC 0.11.9 (A quality control tool for high throughput sequence data)
CutAdapt 4.1 (Automates quality control and adapter trimming of fastq files)
STAR v2.7.1a (Spliced aware ultrafast transcript alligner to reference genome)
StringTie 2.2.1 (A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.)

About-Bulk-RNA-Snakeline

The Allen Institute's Bioinformatics Core team currently employs a pipeline to process raw Bulk RNA-Seq data. This existing pipeline, however, relies on users executing a series of custom bash scripts for each workflow step. This approach is not only time-consuming but also demands extra effort from users to ensure proper script execution, correct parameter adjustments, and accurate file paths. It is crucial to recognize that user errors can negatively impact downstream analyses and compromise result accuracy.

Furthermore, users often face input and output compatibility issues when running multiple scripts. Incompatibilities arise when the output files generated by one script are not compatible with the inputs required for another script due to differences in file formats or software versions. Additionally, the virtual environment must be checked to guarantee the successful installation of all necessary software tools and dependencies.

To minimize manual intervention and enhance the efficiency of processing Bulk RNA-Seq data, the BiCore team is transitioning from a basic Unix shell pipeline to Snakemake. As a user-friendly workflow engine, Snakemake processes data through well-defined rules, each consisting of input and output files, parameters, computational tasks, and, optionally, an environment path. Snakemake's unique features reduce code complexity and enhance readability. Designed specifically for bioinformatics analyses, Snakemake is a domain-specific language (DSL) that offers portability, readability, reproducibility, scalability, and reusability, making it the ideal choice for the BiCore team's needs.

Pipeline Overview

Directory Structure

Authors and History

Beagan Nguy - Algorithm Design
Anish Chakka - Project Manager

Acknowledgments

Allen Institute Bioinformatics Core Team

References

Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bulk-RNA-Snakeline

Motivation

Table of Contents

Quickstart Guide

Required Tools

About-Bulk-RNA-Snakeline

Pipeline Overview

Directory Structure

Authors and History

Acknowledgments

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
Image		Image
configs		configs
envs		envs
main		main
rules		rules
Etc..md		Etc..md
README.md		README.md
main.smk		main.smk
rerun.sh		rerun.sh
run.sh		run.sh
setup.py		setup.py

AllenInstitute/Bulk-RNA-Snakeline

Folders and files

Latest commit

History

Repository files navigation

Bulk-RNA-Snakeline

Motivation

Table of Contents

Quickstart Guide

Required Tools

About-Bulk-RNA-Snakeline

Pipeline Overview

Directory Structure

Authors and History

Acknowledgments

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages