Skip to content

mandhri/WGS-variant-callling---HPC-friendly-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Somatic Tumour/Normal Variant Calling (nf-core/sarek + Nextflow)

Workflow Orchestration Containers Report

Matched tumour/normal (T/N) somatic variant calling workflow using nf-core/sarek, Nextflow, and Apptainer.


Repository structure

.
├── apptainer_cache/         # Apptainer images/cache
├── bin/                     # User-space shims (e.g., singularity -> apptainer)
├── conda_cache/             # Nextflow conda cache (if used)
├── metadata/                # Runinfo + samplesheets (CSVs)
├── nxf_work/                # Nextflow work directory
├── results/                 # Pipeline outputs (MultiQC, VCFs, pipeline_info)
├── scripts/                 # Execution & analysis scripts
└── README.md

Data

Study: SRP162370
Type: Paired-end FASTQs (matched T/N pair)

Sample Accession Role
Tumour SRR8955957 Matched tumour
Normal SRR8955958 Matched normal

Local FASTQs:

data/fastq_raw/
├── SRR8955957_1.fastq.gz   # tumour R1
├── SRR8955957_2.fastq.gz   # tumour R2
├── SRR8955958_1.fastq.gz   # normal R1
└── SRR8955958_2.fastq.gz   # normal R2

Download data (SRA → FASTQ)

This repo includes a helper script: scripts/Download_SRA.sh.

It:

  • pulls RunInfo metadata
  • downloads .sra files via prefetch
  • converts to paired FASTQs via fasterq-dump
  • compresses with pigz

Methodology

  • Orchestration: Nextflow (local executor)
  • Container execution: Apptainer via -profile singularity (shimmed if needed)
  • Pipeline: nf-core/sarek
  • Genome: GATK.GRCh38
  • Callers: Mutect2, Strelka
  • QC: MultiQC (FastQC, fastp, MarkDuplicates, etc.)

Key Sarek parameters:

--genome GATK.GRCh38
--tools mutect2,strelka

Usage

All scripts are in scripts/.

Start from the project root:

cd /mnt/vol1/WGS_variant_callling

1) Download data (SRA; FASTQ)

bash scripts/Download_SRA.sh

Outputs:

  • metadata/SRP162370_runinfo.csv
  • data/sra_raw/
  • data/fastq_raw/

2) Test environment

Validates Nextflow + container execution end-to-end using a test profile.

bash scripts/run_sarek_test.sh

Outputs:

  • results/sarek_test/

3) Generate samplesheet

Convert local FASTQs to Sarek CSV format.

bash scripts/02_make_samplesheets.sh

Outputs:

  • metadata/samplesheet_tn_demo.csv
  • metadata/samplesheet_normal_only.csv

4) Run pipeline (somatic calling)

Execute somatic calling on the tumour/normal pair.

bash scripts/02_run_sarek_somatic.sh

Outputs:

  • results/tn_demo_somatic/

Results summary

QC metrics (MultiQC)

Note: This dataset is a low-coverage demonstration set (0X median), so treat results as workflow validation rather than a full biological interpretation.

Metric Normal Tumour
Reads ~3.4M ~4.0M
% mapped ~99.9% ~100.0%
Duplication (MarkDuplicates) ~95.4% ~95.6%
Median coverage 0X 0X

Variant counts

Callset Total PASS SNV INDEL
Mutect2 (filtered VCF) 1,044 444 697 347
Strelka (somatic SNVs) 2,160 497 2,160 0
Strelka (somatic indels) 18 1 0 18

Caller overlap (SNVs)

  • Total overlap (all SNVs): 464
  • PASS overlap (consensus subset): 228

Key outputs

Category Path
QC report results/tn_demo_somatic/multiqc/multiqc_report.html
Mutect2 VCF results/tn_demo_somatic/variant_calling/mutect2/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.mutect2.filtered.vcf.gz
Strelka SNVs results/tn_demo_somatic/variant_calling/strelka/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.strelka.somatic_snvs.vcf.gz
Strelka indels results/tn_demo_somatic/variant_calling/strelka/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.strelka.somatic_indels.vcf.gz

Dependencies

Runtime

  • Java 11+ (for Nextflow)
  • Nextflow
  • Apptainer

Data download

  • SRA Toolkit (prefetch, fasterq-dump)
  • Entrez Direct (esearch, efetch)
  • pigz

Analysis

  • R
  • R Markdown

Scripts

  • Download_SRA.sh — download SRP162370 metadata + tumour/normal reads, convert to FASTQs, compress
  • run_sarek_test.sh — test profile run (environment validation)
  • 02_make_samplesheets.sh — build Sarek samplesheets from local FASTQs
  • 02_run_sarek_somatic.sh — run somatic calling (Mutect2 + Strelka)
  • 03_tn_demo_somatic_report.Rmd — analysis + interpretation report

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages