This Snakemake pipeline implements the GATK best-practices workflow for calling small genomic variants.
This workflow is adapted from this Snakemake pipeline: https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling
Many updates were made to the pipeline for calling variants in duplicated regions, like the NCF1 gene.
- Eric Karlins
You'll first need to create a fasta file with duplicated regions that match your region of interest masked with Ns.
-
For NCF1, using the hg38 reference fasta, I created a bed file with two lines to mask the regions of NCF1b and NCF1c. This bed file can be found in
resources/NCF1_region_to_mask.bed -
Next I used
bedtools maskfastato create the masked reference file. The command was:
bedtools maskfasta -fi Homo_sapiens_assembly38_plus.fasta -bed NCF1_region_to_mask.bed -fo Homo_sapiens_assembly38_plus_NCF1_mask.fasta
- Create the bwa index files for your new fasta reference:
bwa index Homo_sapiens_assembly38_plus_NCF1_mask.fasta
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Configure the workflow according to your needs via editing the file config.yaml.
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.