This workflow takes you from raw DNA sequencing FASTQ files to a filtered set of variant calls (SNPs and indels). It covers the entire process from quality control through alignment and variant calling.
# CLI tools
conda install -c bioconda fastp bwa-mem2 samtools bcftools
# For GATK path
conda install -c bioconda gatk4Tell your AI agent what you want to do:
- "Call variants from my whole genome sequencing FASTQ files"
- "Run the FASTQ to variants pipeline on my exome data"
- "I have paired-end DNA reads, help me find variants"
"I have FASTQ files for 5 samples, call variants jointly"
"Process my WGS data from raw reads to VCF"
"Use GATK HaplotypeCaller instead of bcftools"
"Add BQSR to my variant calling pipeline"
"Call variants only in the exome target regions"
"Run the pipeline on a specific chromosome"
"I already have BAM files, just call variants"
"My BAMs are not duplicate-marked, process and call variants"
| Input | Format | Description |
|---|---|---|
| FASTQ files | .fastq.gz | Paired-end reads (R1 and R2 per sample) |
| Reference | FASTA | Reference genome (indexed for bwa-mem2) |
| Targets (optional) | BED | For exome/targeted sequencing |
| Known sites (GATK) | VCF | dbSNP for BQSR |
- Quality Control - Trim adapters and low-quality bases
- Alignment - Map reads to reference genome
- BAM Processing - Sort, mark duplicates, index
- Variant Calling - Identify SNPs and indels
- Filtering - Remove low-quality calls
| Use bcftools when | Use GATK when |
|---|---|
| Speed is important | Quality is paramount |
| Germline variants | Somatic variants |
| Small cohort | Large cohort with VQSR |
| Resource-limited | Resources available |
- Read groups: Always add read group information during alignment
- Duplicates: Mark duplicates for PCR-based libraries
- Depth: Check coverage before variant calling (aim for 30x WGS, 100x exome)
- Joint calling: Improves sensitivity, especially for rare variants
- Filtering: Start with default filters, adjust based on Ti/Tv ratio and known site overlap