BIOF501A Project: Taxonomic Classification of the Microbiome in Surviving Colorectal Cancer Patients using the DADA2 Pipeline

By: Michael Yoon

Disclaimer

Sequence files were obtained from the NCBI SRA archive as part of the study "The local tumor microbiome is associated with survival in late-stage colorectal cancer patients" [1]. Experimental samples chosen were obtained from late-stage (III & IV) colorectal cancer patients. Metadata and files downloaded here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=ERP142569&o=acc_s%3Aa

Primers used in this workflow were derived from the study and are as followed:

Forward: CCTACGGGNGGCWGCAG

Reverse: GGACTACHVGGGTATCTAAT

Reference folder contains simplified metadata for chosen samples as well as a sheet containing file paths for nf-core/ampliseq processing.

Introduction:

In 2018, colorectal cancer was reported as one of the most common types of diagnosed cancers ranking 3rd in incidence and 2nd in mortality world wide [2]. Colorectal cancer can cause an array of negative symptoms including fatigue, blood in stool, and abdominal pain which can all negatively impact a patient's quality of life [3]. Generally, the disease is diagnosed by a colonoscopy exam to determine the location of the tumour, however, the technique is known to be fairly invasive and other methods of detection can be beneficial for compliance. In recent times, research has shown how the gut microbiome is associated with colorectal cancer development where certain bacterial strains were identified to be procarcinogenic through the conversion of diestary metabolites into harmful microbial products [4]. Based on this relationship between gut microbiome metabolism and colorectal cancer, there is potential for certain strains to act as biomarkers for the disease. In particular, strains including: Prevotella, Porphyromonas, Peptostreptococcus, Fusobacterium nucleatum, Parvimonas, Bacteroides fragilis, Streptococcus gallolyticus, Enterococcus faecalis and Escherichia coli to name a few [4][5]. For these reasons, investigating the presence and abundance of these strains are relevant in determining colorectal cancer development.

Workflow overview:

The objective of this project was to create a workflow that would take paired-end sequence files to process through the DADA2 pipeline as part of nf-core/Ampliseq. Resulting output files should include sequence quality and dada2 output files showing the number of species assigned to each sample categorized by ASV ID. Further analysis can be conducted on R to visualize abundance.

https://nf-co.re/ampliseq

The main steps of the workflow include:

Unzipping the fastq.gz files
Splitting the fastq files into respective forward (R1) and reverse (R2) reads
Performing fastp to check for quality and determine a trimming length
Zipping the fastq files
Running the DADA2 pipeline in nf-core environment with parameters obtained from the fastp summary

Setting up the environment

Deactivate current environment

conda deactivate

Download this repository

git clone https://github.com/michaelhojungyoon/BIOF501A_Project.git

Add on the following directories:

mkdir results
mkdir sequences/sequences_split
mkdir nf-core results

Create conda environment (processing) using the myenv.environment.yml file

conda create -f myenv_environment.yml -n myenv

Create conda enviroment (nf-core) using nf-core_environemnt.yml file

conda create -f nf-core_environment.yml -n nf-core

Open the nf-core environment and download ampliseq (v2.4.0) -> singularity -> none

conda activate nf-core
nf-core list
nf-core download ampliseq
conda deactivate

Running the workflow:

Ensure sequence files are located in sequences folder. Afterwards, ensure that you are in the work directory with the workflow.nf file. Output files should be located in sequences/sequences_split labeled .noext.fastq.gz

conda activate myenv
nextflow run workflow.nf -c nextflow.config

Deactivate the environment

conda deactivate

Switch to the nf-core enviroment and run the following command. The pipeline will take approximately 10-20 minutes. Note: --max_memory limit can be removed if memory is not an issue.

conda activate nf-core
nextflow run nf-core/ampliseq --input "references/samplesheet.tsv" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GGACTACHVGGGTATCTAAT" --trunclenf 280 --trunclenr 240 --outdir "nf-core results" -profile singularity --max_memory '110.GB'

Expected results

In results folder, output fastp files for quality checking are stored as html files and should look similar to the following:

Read 1:

Read 2:

In nf-core results/dada2, there will be a DADA2_stats.tsv file to show filtered reads and a DADA2_table.tsv file to show ASV IDs associated with each sample.

DADA2_stats.tsv:

DADA2_table.tsv:

In addition to these output files, there are other files showing error logs, qiime2 results, multiqc results, and more.

Troubleshooting

Running workflow.nf appears to stop after completion of each step. May have to run the workflow multiple times to complete each of the steps.
If nf-core ampliseq command gives the following warning message: "At least one input file for the following sample(s) was too small (<1KB)". Either re-run the workflow by first removing all the generated files with the commands as listed or add --ignore_empty_input_files to the nf-core command.

Files that need to be deleted before re-running:

cd sequences
rm *noext
cd sequences/sequences_split
rm *
cd results
rm *

If myenv environment can't be unloaded from the .yml file, install the fastp dependency with:

conda install -c bioconda fastp

References:

Debelius, J. W. et al. The local tumor microbiome is associated with survival in late-stage colorectal cancer patients. 2022.09.16.22279353 Preprint at https://doi.org/10.1101/2022.09.16.22279353 (2022).
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 68, 394–424 (2018).
Kuipers, E. J. et al. COLORECTAL CANCER. Nat. Rev. Dis. Primer 1, 15065 (2015).
Rebersek, M. Gut microbiome and its role in colorectal cancer. BMC Cancer 21, 1325 (2021).
Veziant, J., Villéger, R., Barnich, N. & Bonnet, M. Gut Microbiota as Potential Biomarker and/or Therapeutic Target to Improve the Management of Cancer: Focus on Colibactin-Producing Escherichia coli in Colorectal Cancer. Cancers 13, 2215 (2021).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIOF501A Project: Taxonomic Classification of the Microbiome in Surviving Colorectal Cancer Patients using the DADA2 Pipeline

By: Michael Yoon

Disclaimer

Introduction:

Workflow overview:

Setting up the environment

Running the workflow:

Expected results

Troubleshooting

References:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
figures		figures
references		references
sequences		sequences
README.md		README.md
myenv_environment.yml		myenv_environment.yml
nextflow.config		nextflow.config
nf-core_environment.yml		nf-core_environment.yml
workflow.nf		workflow.nf

michaelhojungyoon/BIOF501A_Project

Folders and files

Latest commit

History

Repository files navigation

BIOF501A Project: Taxonomic Classification of the Microbiome in Surviving Colorectal Cancer Patients using the DADA2 Pipeline

By: Michael Yoon

Disclaimer

Introduction:

Workflow overview:

Setting up the environment

Running the workflow:

Expected results

Troubleshooting

References:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages