Skip to content

Commit 3e70fea

Browse files
committed
adding Clair3 and bcftools mpileup calling
1 parent 43178e8 commit 3e70fea

10 files changed

Lines changed: 684 additions & 52 deletions
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
name: Long Read Mapping Sanity
2+
3+
on:
4+
push:
5+
branches: [dev]
6+
pull_request:
7+
8+
jobs:
9+
sanity:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Run long-read sanity checks
14+
run: bash ci/sanity_long_read_mapping.sh

Contributing.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# `sequence_handling` Design Principles
44

5-
This document records the guiding principles for developing and extending the [`sequence_handling`](https://www.google.com/search?q=%5Bhttps://github.com/MorrellLAB/sequence_handling%5D\(https://github.com/MorrellLAB/sequence_handling\)) pipeline. It is intended as an enduring reference for contributors, particularly during the ongoing modernization of the pipeline to support current GATK versions, long-read sequencing technologies (ONT and PacBio HiFi), and updated tooling.
5+
This document records the guiding principles for developing and extending the [`sequence_handling`](https://github.com/MorrellLAB/sequence_handling) pipeline. It is intended as an enduring reference for contributors, particularly during the ongoing modernization of the pipeline to support current GATK versions, long-read sequencing technologies (ONT and PacBio HiFi), and updated tooling.
66

77
These principles apply to all new handlers, to revisions of existing handlers, and to supporting infrastructure such as config files and Slurm job scripts.
88

@@ -43,7 +43,7 @@ Consistent terminology prevents ambiguity across documentation, code, and issues
4343
- **Config files** (e.g., `Config`) define all parameters for a run. A handler should source a single config and rely entirely on the variables defined there.
4444
- **HelperScripts** (in `HelperScripts`) are scripts for handling multiple samples.
4545
- **Sequence_Accessories** (e.g., `PanDepthCoverage.sh`) are tools that may be used occasionally or that supplement `sequence_handling`.
46-
- This separation of concerns -- job logic in Handlers, scheduling logic in SlurmJobScripts, parameters in Config -- should be preserved as the pipeline grows.
46+
- This separation of concerns - job logic in Handlers, scheduling logic in SlurmJobScripts, parameters in Config - should be preserved as the pipeline grows.
4747

4848
When adding support for new sequencing platforms or tools, follow this multi-layer structure. Do not mix scheduling directives into handler logic.
4949

@@ -54,7 +54,7 @@ When adding support for new sequencing platforms or tools, follow this multi-lay
5454
All shell code must be safe, readable, and linter-compliant.
5555

5656
- Use strict shell options at the top of every handler: `set -euo pipefail`. This ensures the script exits on errors, treats unset variables as errors, and catches failures in pipelines.
57-
- All handlers should pass [`shellcheck`](https://www.google.com/search?q=%5Bhttps://www.shellcheck.net/%5D\(https://www.shellcheck.net/\)) without warnings. `shellcheck` compliance is the baseline standard.
57+
- All handlers should pass [`shellcheck`](https://www.shellcheck.net/) without warnings. `shellcheck` compliance is the baseline standard.
5858
- Where applicable, code should also satisfy [DeepSource](https://deepsource.com/) static analysis alerts.
5959
- Variables that come from the config should be validated before use (e.g., check that a file path is non-empty and the file exists before passing it to a tool).
6060
- Avoid `eval`, unquoted variable expansions in paths, and other patterns that are fragile or unsafe in HPC environments.
@@ -102,7 +102,7 @@ Code that runs silently and produces incorrect results is worse than code that f
102102
Handlers should produce output that is useful for both humans and AI-assisted troubleshooting.
103103

104104
- Log the tool version, key parameters, input files, and output files at the start and end of each handler run. This information should appear in both stdout and the Slurm log.
105-
- Capture metrics that are meaningful for QC: read counts, mapping rates, coverage depth, duplication rates, variant counts, and so forth -- whichever are appropriate for the handler's task.
105+
- Capture metrics that are meaningful for QC: read counts, mapping rates, coverage depth, duplication rates, variant counts, and so forth - whichever are appropriate for the handler's task.
106106
- Use structured, grep-friendly log lines where possible (e.g., `[Fastplong] Sample: ${SAMPLE} | Reads before: ${N_BEFORE} | Reads after: ${N_AFTER}`).
107107
- Do not suppress stderr from tools unless you have explicitly handled the error conditions. Suppressed errors are a common source of silent failures in pipelines.
108108
- Metadata captured at runtime (tool versions, parameters, timestamps) should be written to a per-sample or per-run log file that persists after the job completes.
@@ -128,7 +128,7 @@ All active development occurs on the `dev` branch.
128128

129129
- New handlers, refactored handlers, and bug fixes should be developed on `dev` (or on feature branches that merge into `dev`). The `main` branch reflects the last stable release.
130130
- `dev` is the integration target: it should remain functional and pass basic testing at all times, even if individual features are incomplete.
131-
- When a meaningful set of changes has accumulated -- for example, the addition of long-read support, or a GATK version bump -- prepare a new versioned release. Update the changelog, tag the release, and archive via Zenodo so the version is citable.
131+
- When a meaningful set of changes has accumulated - for example, the addition of long-read support, or a GATK version bump - prepare a new versioned release. Update the changelog, tag the release, and archive via Zenodo so the version is citable.
132132
- Pull requests into `dev` should include: updated config stanzas (if new parameters are introduced), handler-level comments explaining non-obvious logic, and a brief description in the PR of what changed and why.
133133
- Breaking changes to the config format or handler interface should be clearly flagged in the changelog and in a migration note for existing users.
134134

@@ -139,7 +139,7 @@ The `dev` branch is where the next version of `sequence_handling` is being built
139139
## 8. Documentation
140140

141141
- Documentation of the workflow is provided in the `README.md` file. More specifics are available in a [wiki](https://github.com/MorrellLAB/sequence_handling/wiki).
142-
- [`Dependencies`](https://www.google.com/search?q=%5Bhttps://github.com/MorrellLAB/sequence_handling/wiki/Dependencies%5D\(https://github.com/MorrellLAB/sequence_handling/wiki/Dependencies\)) are listed on a dedicated page, which should be updated so users can easily identify the tools needed for successful execution.
142+
- [`Dependencies`](https://github.com/MorrellLAB/sequence_handling/wiki/Dependencies) are listed on a dedicated page, which should be updated so users can easily identify the tools needed for successful execution.
143143
- Where necessary, clarifying comments should be included in code, particularly for more complex operations.
144144
* * *
145145

@@ -157,21 +157,32 @@ _Last updated: March 2026. To be revised as the pipeline evolves._
157157

158158
* * *
159159

160-
## Next steps
160+
## Recently implemented
161161

162162
- Update from GATK v4.1 to GATK v4.6.
163163

164164
- Add an accessory script to generate an AllSites VCF for [pixy v2.0](https://github.com/ksamuk/pixy). The specific examples for using bcftools mpileup and GATK are [here](https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html).
165165

166-
- Document all changes to the new version in a Release file similar to that for [v3.0.0](https://github.com/MorrellLAB/sequence_handling/releases/tag/v3.0.0). Recent updates to the dev branch include replacing' fastp' and' fastplong' with `fastp` and `fastplong` for quality assessment and adapter trimming. This replaces the full front end of the workflow. We have also added `minimap2` for long-read mapping and updated the `config` to include read-mapping presets for PacBio (HiFi reads) and ONT (Q20 reads).
166+
- Document all changes to the new version in a Release file similar to that for [v3.0.0](https://github.com/MorrellLAB/sequence_handling/releases/tag/v3.0.0). Recent updates to the `dev` branch include replacing `fastp` and `fastplong` for quality assessment and adapter trimming. This replaces the full front end of the workflow. We have also added `minimap2` for long-read mapping and updated the `config` to include read-mapping presets for PacBio (HiFi reads) and ONT (Q20 reads).
167167

168168
- Make sure that the concatenation of gzipped fastq files uses `zcat` rather than `cat`. They don't produce the same results.
169169

170170
- Determine if new GATK indel or SV callers require any changes in our workflow.
171171

172+
172173
- Assessment (March 2026): no immediate mandatory workflow changes for germline SNP/indel calling. The current `Haplotype_Caller -> Genomics_DB_Import -> Genotype_GVCFs` path remains valid under GATK 4.6.
173174
- Indel-specific note: GATK 4.6 includes HaplotypeCaller fixes (including long-deletion edge cases), but does not introduce a replacement germline indel caller that requires handler redesign.
174175
- SV-specific note: GATK 4.6 includes SV tooling improvements (for annotation/concordance), but full production SV calling is still typically handled by dedicated SV workflows/tools (e.g., GATK-SV WDL stack, pbsv, Sniffles2). Integrating a new SV-calling branch in `sequence_handling` is optional future work, not a blocker for this release.
175176
- Recommended follow-up: add a dedicated design note before implementing any optional SV branch (inputs, caller choice, output normalization, and filtering strategy).
176177

178+
## Next steps
179+
180+
- Implement a long-read variant caller, probably [`Clair3`](https://github.com/HKU-BAL/Clair3), that will pick appropriate error models given our various read types already specified in the updated `config` file. The read types we know we need to handle include ONT R9 reads, ONT Q20 reads, and PacBio HiFi. For `Clair3`, it may also be important to track which basecaller was used for ONT reads; that would need to be added to the `config` file. This long-read path should bypass most of the GATK germline workflow and land on a VCF that enters a dedicated downstream filtering step. The recommended joint calling path for Clair3 outputs is [`GLnexus`](https://github.com/dnanexus-rnd/GLnexus) rather than GATK's Genomics_DB_Import/Genotype_GVCFs (since GLnexus is designed for consolidating VCFs from non-GATK callers). If needed, we should also be able to create an "AllSites VCF" similar to that for pixy.
181+
182+
- There is also a need to integrate code for some steps in handling Ultima Genomics UG100 resequencing data. For now, this could probably be exclusively in [sequence_accessories](https://github.com/MorrellLAB/sequence_accessories/tree/master). This involves joint variant calling with [GLnexus](https://github.com/dnanexus-rnd/GLnexus) using [GLnexus.sh](https://github.com/pmorrell/Utilities/blob/030effbd0599dd0a0d823cfa19c3bf90bd5e150c/variant_calling/GLnexus.sh#L15) and filtering of those variants using [UG100_filter.sh](https://github.com/MorrellLAB/sequence_accessories/blob/master/Accessories/UG100_filter.sh).
183+
184+
- Integrate existing code for `bcftools mpileup` calling as an alternative for "SNP-only" calling from short-read sequencing. The [bcftools_mpileup.sh](https://github.com/pmorrell/Utilities/blob/030effbd0599dd0a0d823cfa19c3bf90bd5e150c/bcftools_mpileup.sh) script could be added as an additional branch in the workflow after BAM files are sorted and indexed.
185+
186+
- In `sequence_handling_fastp`, we should probably trim the opening number selection to eliminate " 13 | GBS_Demultiplex (in progress)" and all the Nanopore Workflow options. We are integrating long read protocols into the main workflow as side channels. We can move handlers not being deployed to a "Deprecated" directory.
187+
177188
* * *
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
#!/bin/bash
2+
3+
# This script performs SNP-only variant calling using bcftools mpileup
4+
# as an alternative lightweight path after BAM files are sorted and indexed.
5+
# This handler generates only SNP variants (no indels) for short-read data.
6+
7+
set -e
8+
set -o pipefail
9+
10+
# What are the dependencies for Bcftools_Mpileup_Variant_Calling?
11+
declare -a Bcftools_Mpileup_Variant_Calling_Dependencies=(bcftools)
12+
13+
function Bcftools_Mpileup_Variant_Calling() {
14+
local sample_list="$1" # What is our sample list (BAM files)?
15+
local out_dir="$2" # Where are we storing our results?
16+
local ref="$3" # Where is the reference sequence?
17+
local threads="$4" # How many threads can we use?
18+
local tmp="${5:-}" # Temporary directory (optional)
19+
20+
declare -a sample_array=($(grep -E ".bam" "${sample_list}")) # Turn the list into an array
21+
22+
# Create output directories
23+
mkdir -p "${out_dir}/Bcftools_Mpileup_Variant_Calling"
24+
mkdir -p "${out_dir}/Bcftools_Mpileup_Variant_Calling/Intermediates"
25+
26+
# Check if temp variable is provided
27+
if [[ -z "${tmp}" ]]; then
28+
echo "Not using temp directory, proceeding..."
29+
else
30+
echo "Making temp directory: ${tmp}"
31+
mkdir -p "${tmp}"
32+
fi
33+
34+
# Check if reference genome is indexed for bcftools
35+
if ! [[ -f "${ref}.fai" ]]; then
36+
echo "Indexing reference genome for bcftools..."
37+
bcftools faidx "${ref}"
38+
fi
39+
40+
# Verify all BAM files are indexed
41+
for bam in "${sample_array[@]}"; do
42+
if ! [[ -f "${bam}.bai" ]]; then
43+
echo "Indexing BAM file: ${bam}"
44+
samtools index "${bam}"
45+
fi
46+
done
47+
48+
# Perform mpileup variant calling
49+
local mpileup_vcf="${out_dir}/Bcftools_Mpileup_Variant_Calling/Intermediates/mpileup_raw.vcf.gz"
50+
local filtered_vcf="${out_dir}/Bcftools_Mpileup_Variant_Calling/mpileup_snps_only.vcf.gz"
51+
local sample_log="${out_dir}/Bcftools_Mpileup_Variant_Calling/variant_calling.log"
52+
53+
echo "[Bcftools_Mpileup] Starting SNP-only variant calling with bcftools mpileup" | tee -a "${sample_log}"
54+
echo "[Bcftools_Mpileup] Input BAM files:" | tee -a "${sample_log}"
55+
for bam in "${sample_array[@]}"; do
56+
echo "[Bcftools_Mpileup] ${bam}" | tee -a "${sample_log}"
57+
done
58+
echo "[Bcftools_Mpileup] Reference genome: ${ref}" | tee -a "${sample_log}"
59+
60+
# Run bcftools mpileup
61+
if bcftools mpileup \
62+
--fasta-ref "${ref}" \
63+
--output "${mpileup_vcf}" \
64+
--output-type z \
65+
--threads "${threads}" \
66+
--annotate FORMAT/AD,FORMAT/DP \
67+
"${sample_array[@]}" >> "${sample_log}" 2>&1; then
68+
69+
echo "[Bcftools_Mpileup] Successfully completed mpileup" | tee -a "${sample_log}"
70+
71+
# Filter to SNPs only (exclude indels)
72+
echo "[Bcftools_Mpileup] Filtering to SNPs only..." | tee -a "${sample_log}"
73+
74+
if bcftools view \
75+
"${mpileup_vcf}" \
76+
--types snps \
77+
--output "${filtered_vcf}" \
78+
--output-type z >> "${sample_log}" 2>&1; then
79+
80+
echo "[Bcftools_Mpileup] Successfully filtered to SNPs only" | tee -a "${sample_log}"
81+
echo "[Bcftools_Mpileup] Output VCF: ${filtered_vcf}" | tee -a "${sample_log}"
82+
83+
# Index the final VCF
84+
if bcftools index "${filtered_vcf}" >> "${sample_log}" 2>&1; then
85+
echo "[Bcftools_Mpileup] Successfully indexed VCF" | tee -a "${sample_log}"
86+
else
87+
echo "[WARN] Failed to index VCF file" >&2
88+
fi
89+
else
90+
echo "[ERROR] Failed to filter VCF to SNPs only" >&2
91+
echo "[ERROR] Check log file: ${sample_log}" >&2
92+
return 1
93+
fi
94+
else
95+
echo "[ERROR] Bcftools mpileup failed" >&2
96+
echo "[ERROR] Check log file: ${sample_log}" >&2
97+
return 1
98+
fi
99+
100+
echo "[Bcftools_Mpileup] SNP-only variant calling complete"
101+
}
102+
103+
export -f Bcftools_Mpileup_Variant_Calling

0 commit comments

Comments
 (0)