Replace GATK with FreeBayes for variant calling; upgrade Java tools to eliminate CVEs#1053
Replace GATK with FreeBayes for variant calling; upgrade Java tools to eliminate CVEs#1053
Conversation
- picard=2.25.6 → picard>=3.1.1 (eliminates log4j-core 2.5 Log4Shell + 13H) - gatk=3.8 → gatk4>=4.5.0.0 (eliminates log4j 1.x) - fgbio>=2.2.1 → fgbio>=2.3.0 (eliminates commons-io 2.7 CVE) - snpeff=5.1 → snpeff=5.2 (eliminates jackson-databind, gson, commons-io CVEs) Conda solver resolves cleanly on both amd64 and arm64. Python code changes (GATK3→4 API migration) will follow in a separate commit.
- Add CVSS v4.0 vector parsing (get_v4_vector, has_v4_field helpers) - Update all existing rules (Sections 1-5) with v4 equivalents - Add Section 6: filter availability-only impact with scope unchanged (v3: C:N/I:N/S:U; v4: VC:N/VI:N/SC:N/SI:N/SA:N). In ephemeral batch containers, DoS = one failed job, equivalent to corrupted input. - Add CVE-2020-25649 to .trivyignore (jackson-databind XXE in snpEff JAR — DOMDeserializer code path never traversed by snpEff) Filters 18 of 26 remaining Java fat JAR HIGHs via Rego. The remaining 8 are server-only components (zookeeper, jetty, spark, beanutils) or have C/I impact requiring individual triage.
…rely FreeBayes is a Bayesian genotyper (like GATK3 UnifiedGenotyper) that produces standard VCF with AD/DP — the existing vcfrow_parse_and_call_snps() parser works unchanged. --report-monomorphic is a direct equivalent of EMIT_ALL_SITES, and --pooled-continuous replaces the ploidy=4 hack. This eliminates GATK4 and its 14 HIGH CVEs, removes ~400MB from images, and drops the Java dependency for variant calling (FreeBayes is C++). Changes: - Create src/viral_ngs/assemble/freebayes.py (FreeBayes tool wrapper) - Delete src/viral_ngs/core/gatk.py (GATK wrapper) - Remove gatk4 from docker/requirements/core.txt - Add freebayes>=1.3.6 to docker/requirements/assemble.txt - assembly.py: replace GATK calls with FreeBayes, remove IndelRealigner - read_utils.py: remove gatk_ug/gatk_realign commands, remove GATK realign from align_and_fix, remove --skipRealign/--GATK_PATH args - Update tests to remove --skipRealign references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ions Update minimum versions in core.txt and assemble.txt to reflect what conda/bioconda actually resolves today, making our expectations explicit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FreeBayes always writes plain VCF regardless of output filename, but downstream VcfReader (pysam.TabixFile) requires bgzipped + tabix-indexed input. Write to temp .vcf, then compress/index with pysam. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Harmonize mafft (>=7.525) and mummer4 (>=4.0.1) floors between assemble.txt and phylo.txt. Bump classify tool floors to match what bioconda actually resolves. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bbmap, mvicuna, gap2seq, sequip, snpeff: exact pin → >= floor - muscle, novoalign: exact pin → >= floor with major version cap (<4) - illumina-interop: kept pinned at 1.5.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
samtools>=1.23.1 and bcftools>=1.23.1 force htslib>=1.23.1, which conflicts with freebayes ARM64 conda package (needs htslib 1.19-1.21). Revert to the floors that succeeded on prior builds; the solver still installs the latest compatible versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
blast>=2.17.0 does not exist on ARM64. Revert to the floors from the last successful build (blast>=2.15.0 and other classify tool floors). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use already_realigned_bam with minimap2 alignment instead of novoalign, making tests runnable on ARM64. Remove @unittest.skipIf(IS_ARM) guards. Expected output files are placeholders (copied from old GATK3+novoalign versions) — will be updated with actual FreeBayes+minimap2 output after running in Docker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hen not needed Generate new expected FASTA files from FreeBayes+minimap2 pipeline: - refine1: 6bp insertion at pos 1539 (82% of reads support it, QUAL=300) - refine2: 49bp longer consensus (minimap2 maps more reads to ends) Make novoalign instantiation conditional on already_realigned_bam so refine_assembly works on ARM64 when using pre-aligned BAMs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1053 +/- ##
==========================================
- Coverage 58.45% 58.25% -0.20%
==========================================
Files 66 66
Lines 14564 14504 -60
Branches 2687 2685 -2
==========================================
- Hits 8514 8450 -64
- Misses 5336 5337 +1
- Partials 714 717 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR migrates the refine_assembly consensus-refinement variant calling from GATK3 UnifiedGenotyper to FreeBayes, removes GATK-specific code/CLI hooks, upgrades bioinformatics tool dependencies to address CVEs, and introduces a CVSS v4.0-aware Trivy ignore policy to reduce false positives.
Changes:
- Replace GATK UG calls in
refine_assembly()with a new FreeBayes wrapper and remove GATK CLI commands/flags fromread_utils. - Update integration/unit tests and expected FASTA fixtures to use minimap2-based alignment and FreeBayes outputs (including enabling ARM64 test execution).
- Upgrade conda tool dependencies (Picard/fgbio/SnpEff/etc.) and add Trivy ignore policy rules (including CVSS v4.0 vector support) plus a documented
.trivyignoreexception.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/viral_ngs/assembly.py |
Switch refine_assembly() variant calling from GATK to FreeBayes; adjust indexing and docs accordingly. |
src/viral_ngs/assemble/freebayes.py |
New FreeBayes tool wrapper with bgzip + tabix indexing support. |
src/viral_ngs/assemble/__init__.py |
Export freebayes module from the assemble package. |
src/viral_ngs/read_utils.py |
Remove GATK CLI commands and remove --skipRealign / --GATK_PATH from align_and_fix. |
src/viral_ngs/core/__init__.py |
Stop importing gatk from viral_ngs.core. |
src/viral_ngs/core/gatk.py |
Delete the GATK tool wrapper module. |
tests/unit/core/test_read_utils.py |
Remove --skipRealign tests and update align_and_fix invocations accordingly. |
tests/unit/assemble/test_assembly_integration.py |
Replace novoalign usage with minimap2 alignment + already_realigned_bam; update expected outputs to FreeBayes fixtures. |
tests/unit/assemble/test_assembly.py |
Update refine_assembly edge-case tests to avoid novoalign and use minimap2 / already_realigned_bam. |
tests/input/TestRefineAssembly/expected.ebov.refine1.freebayes.fasta |
New expected refine1 output fixture for FreeBayes pipeline. |
tests/input/TestRefineAssembly/expected.ebov.refine2.freebayes.fasta |
New expected refine2 output fixture for FreeBayes pipeline. |
docker/requirements/core.txt |
Remove GATK and upgrade core bioinformatics dependencies (Picard/fgbio/minimap2/etc.). |
docker/requirements/assemble.txt |
Add freebayes>=1.3.6 and upgrade some assemble-related tools. |
docker/requirements/core-x86.txt |
Relax pins for x86-only tools (novoalign/mvicuna). |
docker/requirements/assemble-x86.txt |
Relax pin for gap2seq. |
docker/requirements/phylo.txt |
Upgrade phylo-tool dependencies (bamtools/mafft/mummer4/muscle/snpeff). |
.trivyignore |
Add documented ignore entry for CVE-2020-25649 in snpEff fat JAR. |
.trivy-ignore-policy.rego |
Add CVSS v4.0 vector support and new ignore rules (incl. availability-only, scope-unchanged). |
Comments suppressed due to low confidence (2)
src/viral_ngs/assembly.py:764
tmpVcfis always created with a.vcf.gzsuffix and later copied verbatim tooutVcfregardless of theoutVcffilename/extension. If a caller passesoutVcfending in.vcf(not gz), they’ll still get gzipped content, which is surprising and can break downstream tools. Consider creatingtmpVcfwith the same compression asoutVcf(or write directly tooutVcf) and only generate/copy a.tbiwhen the destination is bgzipped.
# Modify original assembly with VCF calls from FreeBayes
tmpVcf = viral_ngs.core.file.mkstempfname('.vcf.gz')
tmpFasta = viral_ngs.core.file.mkstempfname('.fasta')
fb.call(realignBam, deambigFasta, tmpVcf)
if already_realigned_bam is None:
os.unlink(realignBam)
os.unlink(deambigFasta)
name_opts = []
if chr_names:
name_opts = ['--name'] + chr_names
main_vcf_to_fasta(
parser_vcf_to_fasta(argparse.ArgumentParser(
)).parse_args([
tmpVcf,
tmpFasta,
'--trim_ends',
'--min_coverage',
str(min_coverage),
'--major_cutoff',
str(major_cutoff)
] + name_opts)
)
if outVcf:
shutil.copyfile(tmpVcf, outVcf)
if outVcf.endswith('.gz'):
shutil.copyfile(tmpVcf + '.tbi', outVcf + '.tbi')
src/viral_ngs/assembly.py:710
- The inline comment still says "deambiguated genome for GATK" even though this function now calls FreeBayes and GATK has been removed. Updating this comment will help avoid confusion for future maintenance.
# Sanitize fasta header & create deambiguated genome for GATK
deambigFasta = viral_ngs.core.file.mkstempfname('.deambig.fasta')
with viral_ngs.core.file.fastas_with_sanitized_ids(inFasta, use_tmp=True) as sanitized_fastas:
deambig_fasta(sanitized_fastas[0], deambigFasta)
- Rename '# ========= GATK ==========' section header to 'align_and_fix' - Remove unused IS_ARM import from test_assembly_integration.py - Clean up .tbi temp file after FreeBayes VCF processing - Fix stale 'deambiguated genome for GATK' comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regression testing planTwo Terra submissions are running
Both workspaces have recent What we will compare
Expected resultsBased on unit/integration test analysis, we expect:
Results will be posted here once the submissions complete (~1 day). |
All 9 measles-usa Terra failures were caused by the WDL still passing --skipRealign to align_and_fix. Accept the argument but ignore it since GATK realignment has been removed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regression Testing Report: GATK → FreeBayes Assembly ComparisonCompared all assembly outputs between Measles (128 samples, 95 assembly comparisons)
Divergent assemblies (identity < 99.9%) — 9 assemblies
Respiratory (176 samples, 141 assembly comparisons)
Divergent assemblies (identity < 99.9%) — 59 assemblies
Key Takeaways
🤖 Generated with Claude Code |
VADR Annotation Quality: Old (GATK) vs New (FreeBayes) AssembliesTo assess whether the indel differences between old and new assemblies are biologically meaningful, we ran NCBI VADR on both versions of every assembly that had internal indel differences (excluding adenovirus and rhinovirus, which lack VADR models). VADR detects annotation errors including frameshifts, early stop codons, and other structural problems — fewer alerts generally indicates a more biologically plausible assembly. 15 assemblies × 2 versions = 30 VADR runs, executed via dsub on Google Batch using the Results
Summary
Alert Detail AnalysisWe examined the full VADR alert content ( Cases where GATK introduced frameshifts that FreeBayes avoids
In each case, GATK called a spurious 1bp indel in a coding region → frameshift → premature stop codon → cascading peptide translation failures across all downstream mature peptides. FreeBayes avoids these frameshifts entirely. The one case where FreeBayes is worseVGG_24727 (Measles, +5 alerts): Both old and new have the same hemagglutinin problem (stop codon at pos ~8884). But FreeBayes introduced a new frameshift in the large polymerase (1bp insert at 15261) plus a Cases where the difference is not frameshift-relatedMGH-23152 (SARS-CoV-2, +1 alert): The new assembly's only alert is VGG_24791 (Measles, +1 alert): Both versions share the same large polymerase frameshift (7bp deletion). The new assembly adds a ConclusionThe VADR analysis confirms that the indel differences between GATK and FreeBayes are predominantly frameshift-inducing errors in GATK assemblies that FreeBayes avoids. FreeBayes's haplotype-based reassembly approach produces more biologically plausible consensus sequences, particularly for coding regions in low-to-moderate coverage samples. 🤖 Generated with Claude Code |
Summary
This PR replaces the GATK variant caller with FreeBayes in the
refine_assemblypipeline, removes GATK from the codebase entirely, upgrades Java bioinformatics tools, and adds a CVSS v4.0-aware Trivy vulnerability policy. Together these changes eliminate 5 CRITICAL and 33 HIGH CVEs.1. GATK → FreeBayes migration
Variant caller evaluation
We evaluated four options for replacing GATK3 UnifiedGenotyper:
-out_mode EMIT_ALL_SITESEMIT_ALL_CONFIDENT_SITESmay omit sites)--report-monomorphic(exact equivalent)<NON_REF>(BP_RESOLUTION)--ploidy N--ploidy N--pooled-continuousFreeBayes is the closest algorithmic match to UG — both are Bayesian pileup-based genotypers. Its
--report-monomorphicflag is a direct equivalent ofEMIT_ALL_SITES, emitting every position in standard VCF with per-base DP. The existingvcfrow_parse_and_call_snps()parser works unchanged. HaplotypeCaller uses local reassembly (a bigger behavioral change) and its GVCF output with<NON_REF>alleles would have required parser changes. Mutect2 is a somatic caller and doesn't support all-sites output.Code changes
src/viral_ngs/assemble/freebayes.py— FreeBayes tool wrapper with bgzip+tabix support (FreeBayes writes plain VCF, so we compress and index after calling)src/viral_ngs/core/gatk.pygatk4from conda deps; addedfreebayes>=1.3.6to assemble imagealign_and_fix()andrefine_assembly()gatk_ug,gatk_realign--skipRealign,--GATK_PATHBehavioral differences
Validated by running integration tests in Docker, aligning old vs new consensus with mafft, and examining read pileups with samtools:
refine1 — 6bp insertion at position 1539:
FreeBayes correctly calls a 6bp insertion (
ACGACT) that GATK3 UG missed. Pileup shows 9/11 reads (82%) support the insertion (QUAL=300, AF=1.0, genotype 1/1). This is an improvement — the insertion is clearly real.refine2 — 49bp longer consensus (27bp at 5' + 22bp at 3'):
Root cause is the test aligner change (novoalign → minimap2): minimap2 maps more reads to the terminal regions than novoalign, so coverage at the ends meets the
min_coverage=3threshold andstrip('Nn')trimming has fewer terminal N's to remove. One interior A→N at a position right at the DP=3 boundary.Breaking changes
gatk_ugandgatk_realignCLI commands removed--skipRealignand--GATK_PATHarguments removed fromalign_and_fixandrefine_assemblyviral_ngs.core.gatkmodule deletedrefine_assembly()no longer acceptsgatk_pathparameter2. Java tool upgrades
3. Trivy vulnerability policy (CVSS v4.0 support)
Added
.trivy-ignore-policy.regowith rules that filter out CVEs that are architecturally inapplicable to our deployment model (ephemeral batch containers on cloud PaaS, no inbound listeners, no interactive sessions). Each rule targets a specific CVSS vector pattern and documents its rationale:Each rule supports both CVSS v3.1 and v4.0 vector string formats. The v4.0 support is needed because Trivy is transitioning to v4.0 for newer advisories and some CVEs now only have v4.0 vectors.
Rules explicitly do NOT filter: scope-changed CVEs (container escapes), any CVE with confidentiality or integrity impact, or network-accessible vulnerabilities beyond the DoS-only case.
Per-CVE exceptions (
.trivyignore)Two individual CVEs are excluded in
.trivyignorebecause they cannot be addressed by version bumps or by the Rego policy rules above:CVE-2026-23949 — jaraco.context path traversal in tarball extraction (HIGH)
setuptools(vendorsjaraco.context 5.3.0internally)tarball()context manager) is only used duringpip installof source distributions, which happens at build time only from trusted sources (PyPI, conda-forge). No pip installs happen at runtime.CVE-2020-25649 — jackson-databind XXE in DOMDeserializer (HIGH)
jackson-databind 2.10.5bundled insidesnpEff-5.2fat JARDOMDeserializerwithout disabling external entity resolution.4. Test changes
refine_assemblyintegration tests from novoalign to minimap2 (viaalready_realigned_bampath)@unittest.skipIf(IS_ARM)guards — integration tests now run on ARM64CVE impact
The remaining 1 HIGH is CVE-2026-27459 (pyOpenSSL DTLS) — unrelated to this branch. FreeBayes introduced zero CVEs.
Test plan
assemble_refbasedon known-good samples (in progress)🤖 Generated with Claude Code