twistcgp

A bioinformatics pipeline for processing data from Twist Bioscience's TwistCGP product for targeted enrichment of cancer-associated genes.

Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with twistcgp and beyond.

Pipeline Steps

Index Genome (bwa-mem2, samtools)
Read QC (FastQC)
Trim Adapters (fastp)
Fastq to BAM (fgbio FastqToBam)
Align (bwa-mem2)
Mark Duplicates (picard MarkDuplicates)
Variant Calling via local Assembly of Haplotypes (gatk4/mutect2)
Annotate Variants (SnpEff, Ensembl VEP, CIViCpy)
Calculate Tumor Mutational Burden (pyTMB)
Call CNVs (CNVkit)
Identify MSI (MSIsensor2 or MSIsensor-pro)
Collect Metrics (picard CollectHsMetrics, picard CollectMultipleMetrics, perbase)
Present QC (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with nextflow run twistcpg/main.nf -profile "test,[docker|singularity|conda]" --outdir ./results before running the workflow on actual data.

For a full list of available options run nextflow run twistcpg/main.nf --help --show_hidden.

Prepare a Samplesheet

First, prepare a samplesheet with your input data that looks as follows:

sample,fastq_1,fastq_2
ILLUMINA_PAIRED_END,assets/test-data/fastq/Illumina_TestReads_R1_001.fastq.gz,assets/test-data/fastq/Illumina_TestReads_R2_001.fastq.gz
MGI_SINGLE_END,assets/test-data/fastq/MGI_TestReads_1.fq.gz

Each row represents a fastq file (single-end) or a pair of fastq files (paired end). The sample column provides a unique identifier for the given sample.

Obtain a Genome

The TwistCGP panel was designed using the hg38 Genome in a Bottle (GIAB) reference genome FASTA file which can be obtained from GIAB.

Obtain list of Baits & Targets

You will need a BED or Interval List file for (1) the panel baits and (2) the panel targets. The TwistCGP baits and targets files are available from Twist after purchasing the TwistCGP product. BED files should follow the UCSC BED format specifications; interval list files should adhere to GATK interval list conventions.

Targets will be padded prior to variant calling; the padding size can be adjusted using the --target_padding parameter (default: 100, which adds 100 bp on each side of the interval).

Note

If you lack the baits file, you can provide the panel targets for both arguments. Providing the targets as the baits will invalidate the bait specific metrics in the picard HsMetrics. Additionally, CNV calls from CNVkit may be noiser due to inaccurate modeling of bait locations.

(Optionally) Provide Adapter Sequences

If sequencing data is likely to include adapter sequences, providing these sequences in FASTA format will allow fastp to trim those sequences prior to alignment. The adapter sequences can be supplied to the pipeline using the --adapters_fasta parameter.

Optional Time and Resource Saving Setup

Pre-Generate a Genome Index

Because this pipeline uses bwa-mem2 for alignment, 87GB of memory are required to generate the human genome index. Alternatively, this index can be built without the pipeline and the directory supplied using the --bwa parameter. See docs/bwamem2_index.md for details.

Additionally, the genome index can be saved to the output directory for future use by supplying the --save_reference parameter. Subsequently, you may pass the index using --bwa results/reference/bwamem2.

Pre-Generate a MSI Genome Scan List

Generation of the MSIsensor2 or MSIsensor-pro microsatellite scan list requires a space intensive, uncompressed reference genome. To save time and space you can supply the scan list generated by MSIsensor2 (or MSIsensor-pro) using the --msisensor_scan parameter. See docs/msisensor_scan.md for details.

Additionally, the MSI scan list can be saved to the output directory for future use by supplying the --save_reference parameter. Subsequently, you may pass the MSIsensor scan file using --msisensor_scan results/reference/reference.msisensor_scan.list.

Pre-Generate Variant Annotation Caches

SnpEff and Ensembl VEP require many large files known as a cache with which to annotate variants. To use pre-downloaded caches for variant annotation, supply the parameters --snpeff_cache and/or --ensemblvep_cache with the path to the root of the annotation cache folder. If a cache is not provided, the pipeline will automatically download it (which will add computation time). For details on how to generate each cache see docs/variant_annotation.md.

Additionally, the caches can be saved to the output directory for future use by supplying the --save_reference parameter. Subsequently, you may pass the caches using --snpeff_cache results/reference/snpeff_cache/GRCh38.105 and --ensemblvep_cache results/reference/ensemblvep_cache/vep_cache.

Optional Variant Calling Resource Files

Population Germline Resource VCF

This pipeline uses Mutect2 to perform somatic variant calling on local haplotypes. While Mutect2 does not require a germline resource or a panel of normals (PoN) to run, both are recommended. The germline resource VCF encapsulates population allele frequencies of known germline variants (typically from healthy individuals). These frequencies are used by Mutect2 to model the likelihood that a specific variant is somatic or inherited. The provided VCF file must contain allele frequencies.

The germline resource VCF can be supplied to the pipeline using the --population_germline_vcf parameter. The corresponding TBI file can be supplied using the --population_germline_tbi parameter.

See docs/germline_resource_vcf.md for more details on how to generate this input.

Example Germline Resource VCF Records

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
*      1       10067   .       T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      30.35   PASS    AC=3;AF=7.384E-5
*      1       10108   .       CAACCCT C       46514.32        PASS    AC=6;AF=1.525E-4
*      1       10109   .       AACCCTAACCCT    AAACCCT,*       89837.27        PASS    AC=48,5;AF=0.001223,1.273E-4
*      1       10114   .       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA  *,CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T      36728.97        PASS    AC=55,9,1;AF=0.001373,2.246E-4,2.496E-5
*      1       10119   .       CT      C,*     251.23  PASS    AC=5,1;AF=1.249E-4,2.498E-5
*      1       10120   .       TA      CA,*    14928.74        PASS    AC=10,6;AF=2.5E-4,1.5E-4
*      1       10128   .       ACCCTAACCCTAACCCTAAC    A,*     285.71  PASS    AC=3,1;AF=7.58E-5,2.527E-5
*      1       10131   .       CT      C,*     378.93  PASS    AC=7,5;AF=1.765E-4,1.261E-4
*      1       10132   .       TAACCC  *,T     18025.11        PASS    AC=12,2;AF=3.03E-4,5.049E-5

Panel of Normals VCF

While a panel of normals (PoN) VCF is not required for Mutect2 to run, it is recommended. A PoN is a VCF that contains sites found across multiple "normal" samples (e.g., derived from healthy tissue that is believed to not have somatic alterations), ideally from the same sequencing preparation, pipeline, platform, etc. as the tumor samples. While the germline resource helps model population variants, the PoN VCF filters out technical artifacts to improve the quality of the variant calling analyses.

The panel of normals VCF can be supplied to the pipeline using the --pon_vcf parameter. Its corresponding TBI file can be supplied using the --pon_tbi parameter.

See docs/panel_of_normals_vcf.md for more details on how to generate a panel of normals VCF.

Panel of Normals Reference for CNV Calling

You may supply a Panel of Normal (PON) reference .cnn file for use with CNVkit using the --pon_cnn parameter. If you do not supply a PON reference, a "flat" reference will be used which assumes equal coverage across the panel regions.

For details on how to generate this file see docs/cnvkit_pon.md.

Generate a gnomAD VCF for TMB Calculation

Tumor mutational burden (TMB) is measure of the total number of somatic mutations present within the cancer genome. It is crucial to exclude germline variants for the calculation of TMB. This pipeline expects a VCF derived from gnomAD.

The gnomAD VCF can be supplied to the pipeline using the --gnomad_vcf parameter. Its corresponding TBI file can be supplied using the --gnomad_tbi parameter.

See docs/gnomad_vcf.md for details on how to generate a gnomAD VCF.

Run the Pipeline

Now, you can run the pipeline (including optional inputs) using a command like:

nextflow run twistcgp/main.nf \
   -profile <docker/singularity/conda> \
   --fasta hg38_giab.fa \
   --input samplesheet.csv \
   --baits baits.bed \
   --targets targets.bed \
   --outdir results \
   --bwa resources/hg38_giab/bwamem2 \
   --msisensor_scan resources/hg38_giab.msisensor_scan.list \
   --ensemblvep_cache resources/ensemblevep_cache/vep_cache \
   --snpeff_cache resources/snpeff_cache/GRCh38.105 \
   --population_germline_vcf resources/af-only-gnomad.hg38.vcf.gz \
   --population_germline_tbi resources/af-only-gnomad.hg38.vcf.gz.tbi \
   --pon_vcf resources/1000g_pon.hg38.vcf.gz \
   --pon_tbi resources/1000g_pon.hg38.vcf.gz.tbi \
   --pon_cnn resources/pon.cnn \
   --gnomad_vcf resources/all_chromosomes.intersect.vcf.bgz \
   --gnomad_tbi resources/all_chromosomes.intersect.vcf.bgz.tbi

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

twistcgp was originally written by Erin McAuley and Zach Norgaard of Fulcrum Genomics.

We thank the following people for their extensive assistance in the development of this pipeline:

Nils Homer
Tim Dunn

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
assets		assets
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
tower.yml		tower.yml
twistcgp-template.yaml		twistcgp-template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twistcgp

Pipeline Steps

Usage

Prepare a Samplesheet

Obtain a Genome

Obtain list of Baits & Targets

(Optionally) Provide Adapter Sequences

Optional Time and Resource Saving Setup

Optional Variant Calling Resource Files

Run the Pipeline

Credits

Sponsors

Citations

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

twistcgp

Pipeline Steps

Usage

Prepare a Samplesheet

Obtain a Genome

Obtain list of Baits & Targets

(Optionally) Provide Adapter Sequences

Optional Time and Resource Saving Setup

Optional Variant Calling Resource Files

Run the Pipeline

Credits

Sponsors

Citations

About

Topics

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages