BIOF501A Project: Taxonomic Classification of the Microbiome in Surviving Colorectal Cancer Patients using the DADA2 Pipeline
Sequence files were obtained from the NCBI SRA archive as part of the study "The local tumor microbiome is associated with survival in late-stage colorectal cancer patients" [1]. Experimental samples chosen were obtained from late-stage (III & IV) colorectal cancer patients. Metadata and files downloaded here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=ERP142569&o=acc_s%3Aa
Primers used in this workflow were derived from the study and are as followed:
Forward: CCTACGGGNGGCWGCAG
Reverse: GGACTACHVGGGTATCTAAT
Reference folder contains simplified metadata for chosen samples as well as a sheet containing file paths for nf-core/ampliseq processing.
In 2018, colorectal cancer was reported as one of the most common types of diagnosed cancers ranking 3rd in incidence and 2nd in mortality world wide [2]. Colorectal cancer can cause an array of negative symptoms including fatigue, blood in stool, and abdominal pain which can all negatively impact a patient's quality of life [3]. Generally, the disease is diagnosed by a colonoscopy exam to determine the location of the tumour, however, the technique is known to be fairly invasive and other methods of detection can be beneficial for compliance. In recent times, research has shown how the gut microbiome is associated with colorectal cancer development where certain bacterial strains were identified to be procarcinogenic through the conversion of diestary metabolites into harmful microbial products [4]. Based on this relationship between gut microbiome metabolism and colorectal cancer, there is potential for certain strains to act as biomarkers for the disease. In particular, strains including: Prevotella, Porphyromonas, Peptostreptococcus, Fusobacterium nucleatum, Parvimonas, Bacteroides fragilis, Streptococcus gallolyticus, Enterococcus faecalis and Escherichia coli to name a few [4][5]. For these reasons, investigating the presence and abundance of these strains are relevant in determining colorectal cancer development.
The objective of this project was to create a workflow that would take paired-end sequence files to process through the DADA2 pipeline as part of nf-core/Ampliseq. Resulting output files should include sequence quality and dada2 output files showing the number of species assigned to each sample categorized by ASV ID. Further analysis can be conducted on R to visualize abundance.
The main steps of the workflow include:
- Unzipping the fastq.gz files
- Splitting the fastq files into respective forward (R1) and reverse (R2) reads
- Performing fastp to check for quality and determine a trimming length
- Zipping the fastq files
- Running the DADA2 pipeline in nf-core environment with parameters obtained from the fastp summary
Deactivate current environment
conda deactivateDownload this repository
git clone https://github.com/michaelhojungyoon/BIOF501A_Project.gitAdd on the following directories:
mkdir results
mkdir sequences/sequences_split
mkdir nf-core resultsCreate conda environment (processing) using the myenv.environment.yml file
conda create -f myenv_environment.yml -n myenvCreate conda enviroment (nf-core) using nf-core_environemnt.yml file
conda create -f nf-core_environment.yml -n nf-coreOpen the nf-core environment and download ampliseq (v2.4.0) -> singularity -> none
conda activate nf-core
nf-core list
nf-core download ampliseq
conda deactivateEnsure sequence files are located in sequences folder. Afterwards, ensure that you are in the work directory with the workflow.nf file. Output files should be located in sequences/sequences_split labeled .noext.fastq.gz
conda activate myenv
nextflow run workflow.nf -c nextflow.configDeactivate the environment
conda deactivateSwitch to the nf-core enviroment and run the following command. The pipeline will take approximately 10-20 minutes. Note: --max_memory limit can be removed if memory is not an issue.
conda activate nf-core
nextflow run nf-core/ampliseq --input "references/samplesheet.tsv" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GGACTACHVGGGTATCTAAT" --trunclenf 280 --trunclenr 240 --outdir "nf-core results" -profile singularity --max_memory '110.GB' In results folder, output fastp files for quality checking are stored as html files and should look similar to the following:
In nf-core results/dada2, there will be a DADA2_stats.tsv file to show filtered reads and a DADA2_table.tsv file to show ASV IDs associated with each sample.
In addition to these output files, there are other files showing error logs, qiime2 results, multiqc results, and more.
- Running workflow.nf appears to stop after completion of each step. May have to run the workflow multiple times to complete each of the steps.
- If nf-core ampliseq command gives the following warning message: "At least one input file for the following sample(s) was too small (<1KB)". Either re-run the workflow by first removing all the generated files with the commands as listed or add --ignore_empty_input_files to the nf-core command.
Files that need to be deleted before re-running:
cd sequences
rm *noext
cd sequences/sequences_split
rm *
cd results
rm *- If myenv environment can't be unloaded from the .yml file, install the fastp dependency with:
conda install -c bioconda fastp- Debelius, J. W. et al. The local tumor microbiome is associated with survival in late-stage colorectal cancer patients. 2022.09.16.22279353 Preprint at https://doi.org/10.1101/2022.09.16.22279353 (2022).
- Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 68, 394–424 (2018).
- Kuipers, E. J. et al. COLORECTAL CANCER. Nat. Rev. Dis. Primer 1, 15065 (2015).
- Rebersek, M. Gut microbiome and its role in colorectal cancer. BMC Cancer 21, 1325 (2021).
- Veziant, J., Villéger, R., Barnich, N. & Bonnet, M. Gut Microbiota as Potential Biomarker and/or Therapeutic Target to Improve the Management of Cancer: Focus on Colibactin-Producing Escherichia coli in Colorectal Cancer. Cancers 13, 2215 (2021).



