Little workflow which can download and map multiple RNA sequencing files from the NCBI SRA as well as any local FASTQ files to a common reference using kallisto. Because it is written in Nextflow, it can automatically parallelize steps across CPUs or nodes, if you are running it on a cluster (see this page for more details). It is also built to be economical with disk space by removing large intermediary files when they are no longer needed. The output is a combined table containing abundance quantifications as well as FastQC reports for each of sequence files.
rnaseq-mapper will try to load the following modules: sratoolkit, kallisto, R, fastqc.
If your system doesn't use modules, make sure the execs are available in your PATH.
If you have not yet used the sratoolkit before, you will also need to configure it.
To do this, run vdb-config --interactive.
I recommend turning off enable local file caching in the Cache settings because this may keep downloaded sequence files on your disk even after the rnaseq-mapper deletes them which may cause your disk to run out of space if you process a lot of sequences.
- Set up nextflow (if not installed already):
curl -s https://get.nextflow.io | bash
- Create a file called
nextflow.config(exactly this name) by using theexample_nextflow.configfrom this directory as a template and adapting it to your use case. - Create an input file with the sequences you want to map in the format of
example_input.csvand make sure it is referred to in your config file. - If desired, place any FASTQ files in the directories referenced in your
nextflow.config(if you don't have any, make sure the folders still exist and just leave them empty). - Run the pipeline:
./nextflow run NAMlab/rnaseq-mapper
If you prefer, you can also make use of the Singularity container that packages all the required software (sratoolkit, kallisto, R, fastqc).
This requires Singularity or Apptainer to be installed in your system.
You can then simply execute the pipeline (step 5 above, the other steps stay the same) via
./nextflow run NAMlab/rnaseq-mapper -with-singularity library://merlin/default/rnaseq-mapper:latest
or
./nextflow run NAMlab/rnaseq-mapper -with-apptainer library://merlin/default/rnaseq-mapper:latest
respectively.
You will get out a TSV file with the combined kallisto outputs for all your sequence files like this one (by default in the work/out folder):
| target_id | length | SRR1805735_eff_length | SRR1805735_est_counts | SRR1805735_tpm | SRR1805737_eff_length | SRR1805737_est_counts | SRR1805737_tpm | SRR6512869_eff_length | SRR6512869_est_counts | SRR6512869_tpm |
|---|---|---|---|---|---|---|---|---|---|---|
| Solyc00g005280.1.1 | 411 | 252.224 | 0 | 0 | 241.253 | 0 | 0 | 212 | 0 | 0 |
| Solyc00g005285.2.1 | 216 | 68.6464 | 0 | 0 | 63.7937 | 0 | 0 | 31.5146 | 0 | 0 |
| Solyc00g006483.2.1 | 390 | 231.296 | 0 | 0 | 220.691 | 0 | 0 | 191 | 0 | 0 |
| Solyc00g006487.2.1 | 276 | 120.525 | 0 | 0 | 114.108 | 0 | 0 | 77.4659 | 2 | 22.2662 |
| Solyc00g006560.2.1 | 1317 | 1158 | 0 | 0 | 1145.76 | 0 | 0 | 1118 | 0 | 0 |
| Solyc00g006890.2.1 | 300 | 143.123 | 0 | 0 | 135.795 | 0 | 0 | 101.044 | 0 | 0 |
| Solyc00g006900.2.1 | 576 | 416.999 | 0 | 0 | 404.931 | 0 | 0 | 377 | 0 | 0 |
| Solyc00g007225.2.1 | 1275 | 1116 | 0 | 0 | 1103.76 | 0 | 0 | 1076 | 0 | 0 |
| Solyc00g007330.1.1 | 516 | 356.999 | 0 | 0 | 345.082 | 0 | 0 | 317 | 0 | 0 |
You will also get FastQC reports for each of sequence files in the same folder.
- rnaseq-mapper will retry downloading and mapping sequences from NCBI up to 4 times, if that still fails (e.g. because the sequence entry is not available or something is wrong with it), it will skip that sequence and your output will not contain any abundance information for it. So I recommend checking which sequences were actually mapped when working with the combined abundance file instead of assuming it will contain all accessions from your input.
- if you map a lot of sequences (> 2000), the combineAll step which merges all the produced abundance files into one
combined_abundance.tsvcan get very slow. You can make sure thedata.tablepackage is available in the R environment so rnaseq-mapper can use the fasterfreadandReducefunctions.