Matched tumour/normal (T/N) somatic variant calling workflow using nf-core/sarek, Nextflow, and Apptainer.
.
├── apptainer_cache/ # Apptainer images/cache
├── bin/ # User-space shims (e.g., singularity -> apptainer)
├── conda_cache/ # Nextflow conda cache (if used)
├── metadata/ # Runinfo + samplesheets (CSVs)
├── nxf_work/ # Nextflow work directory
├── results/ # Pipeline outputs (MultiQC, VCFs, pipeline_info)
├── scripts/ # Execution & analysis scripts
└── README.md
Study: SRP162370
Type: Paired-end FASTQs (matched T/N pair)
| Sample | Accession | Role |
|---|---|---|
| Tumour | SRR8955957 |
Matched tumour |
| Normal | SRR8955958 |
Matched normal |
Local FASTQs:
data/fastq_raw/
├── SRR8955957_1.fastq.gz # tumour R1
├── SRR8955957_2.fastq.gz # tumour R2
├── SRR8955958_1.fastq.gz # normal R1
└── SRR8955958_2.fastq.gz # normal R2
This repo includes a helper script: scripts/Download_SRA.sh.
It:
- pulls RunInfo metadata
- downloads
.srafiles viaprefetch - converts to paired FASTQs via
fasterq-dump - compresses with
pigz
- Orchestration: Nextflow (local executor)
- Container execution: Apptainer via
-profile singularity(shimmed if needed) - Pipeline:
nf-core/sarek - Genome:
GATK.GRCh38 - Callers: Mutect2, Strelka
- QC: MultiQC (FastQC, fastp, MarkDuplicates, etc.)
Key Sarek parameters:
--genome GATK.GRCh38
--tools mutect2,strelka
All scripts are in scripts/.
Start from the project root:
cd /mnt/vol1/WGS_variant_calllingbash scripts/Download_SRA.shOutputs:
metadata/SRP162370_runinfo.csvdata/sra_raw/data/fastq_raw/
Validates Nextflow + container execution end-to-end using a test profile.
bash scripts/run_sarek_test.shOutputs:
results/sarek_test/
Convert local FASTQs to Sarek CSV format.
bash scripts/02_make_samplesheets.shOutputs:
metadata/samplesheet_tn_demo.csvmetadata/samplesheet_normal_only.csv
Execute somatic calling on the tumour/normal pair.
bash scripts/02_run_sarek_somatic.shOutputs:
results/tn_demo_somatic/
Note: This dataset is a low-coverage demonstration set (0X median), so treat results as workflow validation rather than a full biological interpretation.
| Metric | Normal | Tumour |
|---|---|---|
| Reads | ~3.4M | ~4.0M |
| % mapped | ~99.9% | ~100.0% |
| Duplication (MarkDuplicates) | ~95.4% | ~95.6% |
| Median coverage | 0X | 0X |
| Callset | Total | PASS | SNV | INDEL |
|---|---|---|---|---|
| Mutect2 (filtered VCF) | 1,044 | 444 | 697 | 347 |
| Strelka (somatic SNVs) | 2,160 | 497 | 2,160 | 0 |
| Strelka (somatic indels) | 18 | 1 | 0 | 18 |
- Total overlap (all SNVs): 464
- PASS overlap (consensus subset): 228
| Category | Path |
|---|---|
| QC report | results/tn_demo_somatic/multiqc/multiqc_report.html |
| Mutect2 VCF | results/tn_demo_somatic/variant_calling/mutect2/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.mutect2.filtered.vcf.gz |
| Strelka SNVs | results/tn_demo_somatic/variant_calling/strelka/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.strelka.somatic_snvs.vcf.gz |
| Strelka indels | results/tn_demo_somatic/variant_calling/strelka/TUMOUR_vs_NORMAL/TUMOUR_vs_NORMAL.strelka.somatic_indels.vcf.gz |
- Java 11+ (for Nextflow)
- Nextflow
- Apptainer
- SRA Toolkit (
prefetch,fasterq-dump) - Entrez Direct (
esearch,efetch) pigz
- R
- R Markdown
Download_SRA.sh— download SRP162370 metadata + tumour/normal reads, convert to FASTQs, compressrun_sarek_test.sh— test profile run (environment validation)02_make_samplesheets.sh— build Sarek samplesheets from local FASTQs02_run_sarek_somatic.sh— run somatic calling (Mutect2 + Strelka)03_tn_demo_somatic_report.Rmd— analysis + interpretation report