A bioinformatics pipeline for processing data from Twist Bioscience's TwistCGP product for targeted enrichment of cancer-associated genes.
Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with twistcgp and beyond.
- Index Genome (
bwa-mem2,samtools) - Read QC (
FastQC) - Trim Adapters (
fastp) - Fastq to BAM (
fgbio FastqToBam) - Align (
bwa-mem2) - Mark Duplicates (
picard MarkDuplicates) - Variant Calling via local Assembly of Haplotypes (
gatk4/mutect2) - Annotate Variants (
SnpEff,Ensembl VEP,CIViCpy) - Calculate Tumor Mutational Burden (
pyTMB) - Call CNVs (
CNVkit) - Identify MSI (
MSIsensor2orMSIsensor-pro) - Collect Metrics (
picard CollectHsMetrics,picard CollectMultipleMetrics,perbase) - Present QC (
MultiQC)
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.
Make sure to test your setup with nextflow run twistcpg/main.nf -profile "test,[docker|singularity|conda]" --outdir ./results before running the workflow on actual data.
For a full list of available options run nextflow run twistcpg/main.nf --help --show_hidden.
First, prepare a samplesheet with your input data that looks as follows:
sample,fastq_1,fastq_2
ILLUMINA_PAIRED_END,assets/test-data/fastq/Illumina_TestReads_R1_001.fastq.gz,assets/test-data/fastq/Illumina_TestReads_R2_001.fastq.gz
MGI_SINGLE_END,assets/test-data/fastq/MGI_TestReads_1.fq.gz
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). The sample column provides a unique identifier for the given sample.
The TwistCGP panel was designed using the hg38 Genome in a Bottle (GIAB) reference genome FASTA file which can be obtained from GIAB.
You will need a BED or Interval List file for (1) the panel baits and (2) the panel targets. The TwistCGP baits and targets files are available from Twist after purchasing the TwistCGP product. BED files should follow the UCSC BED format specifications; interval list files should adhere to GATK interval list conventions.
Targets will be padded prior to variant calling; the padding size can be adjusted using the --target_padding parameter (default: 100, which adds 100 bp on each side of the interval).
Note
If you lack the baits file, you can provide the panel targets for both arguments.
Providing the targets as the baits will invalidate the bait specific metrics in the picard HsMetrics.
Additionally, CNV calls from CNVkit may be noiser due to inaccurate modeling of bait locations.
If sequencing data is likely to include adapter sequences, providing these sequences in FASTA format will allow fastp to trim those sequences prior to alignment.
The adapter sequences can be supplied to the pipeline using the --adapters_fasta parameter.
Pre-Generate a Genome Index
Because this pipeline uses bwa-mem2 for alignment, 87GB of memory are required to generate the human genome index.
Alternatively, this index can be built without the pipeline and the directory supplied using the --bwa parameter.
See docs/bwamem2_index.md for details.
Additionally, the genome index can be saved to the output directory for future use by supplying the --save_reference parameter.
Subsequently, you may pass the index using --bwa results/reference/bwamem2.
Pre-Generate a MSI Genome Scan List
Generation of the MSIsensor2 or MSIsensor-pro microsatellite scan list requires a space intensive, uncompressed reference genome.
To save time and space you can supply the scan list generated by MSIsensor2 (or MSIsensor-pro) using the --msisensor_scan parameter.
See docs/msisensor_scan.md for details.
Additionally, the MSI scan list can be saved to the output directory for future use by supplying the --save_reference parameter.
Subsequently, you may pass the MSIsensor scan file using --msisensor_scan results/reference/reference.msisensor_scan.list.
Pre-Generate Variant Annotation Caches
SnpEff and Ensembl VEP require many large files known as a cache with which to annotate variants. To use pre-downloaded caches for variant annotation, supply the parameters --snpeff_cache and/or --ensemblvep_cache with the path to the root of the annotation cache folder. If a cache is not provided, the pipeline will automatically download it (which will add computation time). For details on how to generate each cache see docs/variant_annotation.md.
Additionally, the caches can be saved to the output directory for future use by supplying the --save_reference parameter.
Subsequently, you may pass the caches using --snpeff_cache results/reference/snpeff_cache/GRCh38.105 and --ensemblvep_cache results/reference/ensemblvep_cache/vep_cache.
Population Germline Resource VCF
This pipeline uses Mutect2 to perform somatic variant calling on local haplotypes. While Mutect2 does not require a germline resource or a panel of normals (PoN) to run, both are recommended. The germline resource VCF encapsulates population allele frequencies of known germline variants (typically from healthy individuals). These frequencies are used by Mutect2 to model the likelihood that a specific variant is somatic or inherited. The provided VCF file must contain allele frequencies.
The germline resource VCF can be supplied to the pipeline using the --population_germline_vcf parameter.
The corresponding TBI file can be supplied using the --population_germline_tbi parameter.
See docs/germline_resource_vcf.md for more details on how to generate this input.
Example Germline Resource VCF Records
#CHROM POS ID REF ALT QUAL FILTER INFO
* 1 10067 . T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 30.35 PASS AC=3;AF=7.384E-5
* 1 10108 . CAACCCT C 46514.32 PASS AC=6;AF=1.525E-4
* 1 10109 . AACCCTAACCCT AAACCCT,* 89837.27 PASS AC=48,5;AF=0.001223,1.273E-4
* 1 10114 . TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA *,CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T 36728.97 PASS AC=55,9,1;AF=0.001373,2.246E-4,2.496E-5
* 1 10119 . CT C,* 251.23 PASS AC=5,1;AF=1.249E-4,2.498E-5
* 1 10120 . TA CA,* 14928.74 PASS AC=10,6;AF=2.5E-4,1.5E-4
* 1 10128 . ACCCTAACCCTAACCCTAAC A,* 285.71 PASS AC=3,1;AF=7.58E-5,2.527E-5
* 1 10131 . CT C,* 378.93 PASS AC=7,5;AF=1.765E-4,1.261E-4
* 1 10132 . TAACCC *,T 18025.11 PASS AC=12,2;AF=3.03E-4,5.049E-5
Panel of Normals VCF
While a panel of normals (PoN) VCF is not required for Mutect2 to run, it is recommended. A PoN is a VCF that contains sites found across multiple "normal" samples (e.g., derived from healthy tissue that is believed to not have somatic alterations), ideally from the same sequencing preparation, pipeline, platform, etc. as the tumor samples. While the germline resource helps model population variants, the PoN VCF filters out technical artifacts to improve the quality of the variant calling analyses.
The panel of normals VCF can be supplied to the pipeline using the --pon_vcf parameter.
Its corresponding TBI file can be supplied using the --pon_tbi parameter.
See docs/panel_of_normals_vcf.md for more details on how to generate a panel of normals VCF.
Panel of Normals Reference for CNV Calling
You may supply a Panel of Normal (PON) reference .cnn file for use with CNVkit using the --pon_cnn parameter. If you do not supply a PON reference, a "flat" reference will be used which assumes equal coverage across the panel regions.
For details on how to generate this file see docs/cnvkit_pon.md.
Generate a gnomAD VCF for TMB Calculation
Tumor mutational burden (TMB) is measure of the total number of somatic mutations present within the cancer genome. It is crucial to exclude germline variants for the calculation of TMB. This pipeline expects a VCF derived from gnomAD.
The gnomAD VCF can be supplied to the pipeline using the --gnomad_vcf parameter.
Its corresponding TBI file can be supplied using the --gnomad_tbi parameter.
See docs/gnomad_vcf.md for details on how to generate a gnomAD VCF.
Now, you can run the pipeline (including optional inputs) using a command like:
nextflow run twistcgp/main.nf \
-profile <docker/singularity/conda> \
--fasta hg38_giab.fa \
--input samplesheet.csv \
--baits baits.bed \
--targets targets.bed \
--outdir results \
--bwa resources/hg38_giab/bwamem2 \
--msisensor_scan resources/hg38_giab.msisensor_scan.list \
--ensemblvep_cache resources/ensemblevep_cache/vep_cache \
--snpeff_cache resources/snpeff_cache/GRCh38.105 \
--population_germline_vcf resources/af-only-gnomad.hg38.vcf.gz \
--population_germline_tbi resources/af-only-gnomad.hg38.vcf.gz.tbi \
--pon_vcf resources/1000g_pon.hg38.vcf.gz \
--pon_tbi resources/1000g_pon.hg38.vcf.gz.tbi \
--pon_cnn resources/pon.cnn \
--gnomad_vcf resources/all_chromosomes.intersect.vcf.bgz \
--gnomad_tbi resources/all_chromosomes.intersect.vcf.bgz.tbiWarning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
twistcgp was originally written by Erin McAuley and Zach Norgaard of Fulcrum Genomics.
We thank the following people for their extensive assistance in the development of this pipeline:
- Nils Homer
- Tim Dunn
Sponsors provide support for twistcgp through direct funding or employing contributors.
Public sponsors include:
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
