Skip to content

Replace GATK with FreeBayes for variant calling; upgrade Java tools to eliminate CVEs#1053

Merged
dpark01 merged 16 commits intomainfrom
upgrade-java-tools
Mar 21, 2026
Merged

Replace GATK with FreeBayes for variant calling; upgrade Java tools to eliminate CVEs#1053
dpark01 merged 16 commits intomainfrom
upgrade-java-tools

Conversation

@dpark01
Copy link
Copy Markdown
Member

@dpark01 dpark01 commented Mar 20, 2026

Summary

This PR replaces the GATK variant caller with FreeBayes in the refine_assembly pipeline, removes GATK from the codebase entirely, upgrades Java bioinformatics tools, and adds a CVSS v4.0-aware Trivy vulnerability policy. Together these changes eliminate 5 CRITICAL and 33 HIGH CVEs.

1. GATK → FreeBayes migration

Variant caller evaluation

We evaluated four options for replacing GATK3 UnifiedGenotyper:

UnifiedGenotyper (old) HaplotypeCaller Mutect2 FreeBayes
Algorithm Bayesian genotyper Local reassembly + Bayesian Somatic caller Bayesian genotyper
EMIT_ALL_SITES -out_mode EMIT_ALL_SITES Unclear (EMIT_ALL_CONFIDENT_SITES may omit sites) Not supported --report-monomorphic (exact equivalent)
Output format Standard VCF (AD, DP) GVCF with <NON_REF> (BP_RESOLUTION) Variants-only Standard VCF (AD, DP)
Ploidy --ploidy N --ploidy N No fixed ploidy --pooled-continuous
Java fat JAR CVEs N/A (old) 14 HIGH 14 HIGH None (C++ binary)
Image size impact N/A +400MB +400MB ~2MB

FreeBayes is the closest algorithmic match to UG — both are Bayesian pileup-based genotypers. Its --report-monomorphic flag is a direct equivalent of EMIT_ALL_SITES, emitting every position in standard VCF with per-base DP. The existing vcfrow_parse_and_call_snps() parser works unchanged. HaplotypeCaller uses local reassembly (a bigger behavioral change) and its GVCF output with <NON_REF> alleles would have required parser changes. Mutect2 is a somatic caller and doesn't support all-sites output.

Code changes

  • Created src/viral_ngs/assemble/freebayes.py — FreeBayes tool wrapper with bgzip+tabix support (FreeBayes writes plain VCF, so we compress and index after calling)
  • Deleted src/viral_ngs/core/gatk.py
  • Removed gatk4 from conda deps; added freebayes>=1.3.6 to assemble image
  • Removed IndelRealigner step from align_and_fix() and refine_assembly()
  • Removed CLI commands: gatk_ug, gatk_realign
  • Removed CLI args: --skipRealign, --GATK_PATH

Behavioral differences

Validated by running integration tests in Docker, aligning old vs new consensus with mafft, and examining read pileups with samtools:

refine1 — 6bp insertion at position 1539:
FreeBayes correctly calls a 6bp insertion (ACGACT) that GATK3 UG missed. Pileup shows 9/11 reads (82%) support the insertion (QUAL=300, AF=1.0, genotype 1/1). This is an improvement — the insertion is clearly real.

refine2 — 49bp longer consensus (27bp at 5' + 22bp at 3'):
Root cause is the test aligner change (novoalign → minimap2): minimap2 maps more reads to the terminal regions than novoalign, so coverage at the ends meets the min_coverage=3 threshold and strip('Nn') trimming has fewer terminal N's to remove. One interior A→N at a position right at the DP=3 boundary.

Breaking changes

  • gatk_ug and gatk_realign CLI commands removed
  • --skipRealign and --GATK_PATH arguments removed from align_and_fix and refine_assembly
  • viral_ngs.core.gatk module deleted
  • refine_assembly() no longer accepts gatk_path parameter

2. Java tool upgrades

  • Picard → 3.4.0 (eliminates log4j CRITICAL CVEs)
  • fgbio → 3.1.2
  • SnpEff → 5.2 (OpenJDK 17)

3. Trivy vulnerability policy (CVSS v4.0 support)

Added .trivy-ignore-policy.rego with rules that filter out CVEs that are architecturally inapplicable to our deployment model (ephemeral batch containers on cloud PaaS, no inbound listeners, no interactive sessions). Each rule targets a specific CVSS vector pattern and documents its rationale:

Rule CVSS pattern Rationale
Physical access AV:P Containers are cloud-hosted, never physically accessible
Adjacent network AV:A Containers run on orchestrated cloud infra, attacker cannot be on same network segment
Local + user interaction AV:L + UI:R Batch containers have no interactive sessions — no human clicking links or opening files
Local + high privileges AV:L + PR:H Containers run as non-root with dropped capabilities
Local + scope unchanged AV:L + S:U Attacker already has code execution inside an ephemeral container; vulnerability grants no additional capability
Availability-only + scope unchanged C:N + I:N + S:U DoS in a batch container = one failed job, container destroyed, no data leak or corruption. Operationally equivalent to a corrupted input file

Each rule supports both CVSS v3.1 and v4.0 vector string formats. The v4.0 support is needed because Trivy is transitioning to v4.0 for newer advisories and some CVEs now only have v4.0 vectors.

Rules explicitly do NOT filter: scope-changed CVEs (container escapes), any CVE with confidentiality or integrity impact, or network-accessible vulnerabilities beyond the DoS-only case.

Per-CVE exceptions (.trivyignore)

Two individual CVEs are excluded in .trivyignore because they cannot be addressed by version bumps or by the Rego policy rules above:

CVE-2026-23949 — jaraco.context path traversal in tarball extraction (HIGH)

  • Package: setuptools (vendors jaraco.context 5.3.0 internally)
  • The vulnerable code path (tarball() context manager) is only used during pip install of source distributions, which happens at build time only from trusted sources (PyPI, conda-forge). No pip installs happen at runtime.
  • Even if triggered, the path traversal writes files inside an ephemeral container that is destroyed when the job completes — no persistence, no lateral movement.
  • NVD scored this AV:N/S:C, but this assumes a network-facing service extracting untrusted archives. In our deployment, there is no such service. The Rego policy cannot filter it because the NVD score uses AV:N rather than the more appropriate AV:L.
  • Will self-resolve when setuptools updates its vendored jaraco.context to >= 6.1.0.

CVE-2020-25649 — jackson-databind XXE in DOMDeserializer (HIGH)

  • Package: jackson-databind 2.10.5 bundled inside snpEff-5.2 fat JAR
  • The vulnerability requires jackson-databind to deserialize XML input via DOMDeserializer without disabling external entity resolution.
  • SnpEff uses jackson-databind for JSON parsing only (config metadata, database indices). Its inputs are VCF, GenBank, and properties files — it never parses XML through Jackson. The vulnerable code path is present in the JAR but never traversed.
  • Cannot be fixed without an upstream snpEff release that updates its bundled jackson-databind.

4. Test changes

  • Switched refine_assembly integration tests from novoalign to minimap2 (via already_realigned_bam path)
  • Removed @unittest.skipIf(IS_ARM) guards — integration tests now run on ARM64
  • New expected FASTA files generated from FreeBayes+minimap2 pipeline

CVE impact

Metric main This branch Change
CRITICAL 5 0 -5 eliminated
HIGH 34 1 -33 (97% reduction)

The remaining 1 HIGH is CVE-2026-27459 (pyOpenSSL DTLS) — unrelated to this branch. FreeBayes introduced zero CVEs.

Test plan

  • CI unit tests pass (all images, amd64+arm64)
  • CI Trivy security scans pass
  • Local Docker integration tests with mafft alignment and pileup analysis
  • Terra regression testing — assemble_refbased on known-good samples (in progress)

🤖 Generated with Claude Code

dpark01 and others added 14 commits March 19, 2026 23:09
- picard=2.25.6 → picard>=3.1.1 (eliminates log4j-core 2.5 Log4Shell + 13H)
- gatk=3.8 → gatk4>=4.5.0.0 (eliminates log4j 1.x)
- fgbio>=2.2.1 → fgbio>=2.3.0 (eliminates commons-io 2.7 CVE)
- snpeff=5.1 → snpeff=5.2 (eliminates jackson-databind, gson, commons-io CVEs)

Conda solver resolves cleanly on both amd64 and arm64.
Python code changes (GATK3→4 API migration) will follow in a separate commit.
- Add CVSS v4.0 vector parsing (get_v4_vector, has_v4_field helpers)
- Update all existing rules (Sections 1-5) with v4 equivalents
- Add Section 6: filter availability-only impact with scope unchanged
  (v3: C:N/I:N/S:U; v4: VC:N/VI:N/SC:N/SI:N/SA:N). In ephemeral batch
  containers, DoS = one failed job, equivalent to corrupted input.
- Add CVE-2020-25649 to .trivyignore (jackson-databind XXE in snpEff
  JAR — DOMDeserializer code path never traversed by snpEff)

Filters 18 of 26 remaining Java fat JAR HIGHs via Rego. The remaining
8 are server-only components (zookeeper, jetty, spark, beanutils) or
have C/I impact requiring individual triage.
…rely

FreeBayes is a Bayesian genotyper (like GATK3 UnifiedGenotyper) that
produces standard VCF with AD/DP — the existing vcfrow_parse_and_call_snps()
parser works unchanged. --report-monomorphic is a direct equivalent of
EMIT_ALL_SITES, and --pooled-continuous replaces the ploidy=4 hack.

This eliminates GATK4 and its 14 HIGH CVEs, removes ~400MB from images,
and drops the Java dependency for variant calling (FreeBayes is C++).

Changes:
- Create src/viral_ngs/assemble/freebayes.py (FreeBayes tool wrapper)
- Delete src/viral_ngs/core/gatk.py (GATK wrapper)
- Remove gatk4 from docker/requirements/core.txt
- Add freebayes>=1.3.6 to docker/requirements/assemble.txt
- assembly.py: replace GATK calls with FreeBayes, remove IndelRealigner
- read_utils.py: remove gatk_ug/gatk_realign commands, remove GATK
  realign from align_and_fix, remove --skipRealign/--GATK_PATH args
- Update tests to remove --skipRealign references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ions

Update minimum versions in core.txt and assemble.txt to reflect what
conda/bioconda actually resolves today, making our expectations explicit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FreeBayes always writes plain VCF regardless of output filename, but
downstream VcfReader (pysam.TabixFile) requires bgzipped + tabix-indexed
input. Write to temp .vcf, then compress/index with pysam.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Harmonize mafft (>=7.525) and mummer4 (>=4.0.1) floors between
assemble.txt and phylo.txt. Bump classify tool floors to match
what bioconda actually resolves.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bbmap, mvicuna, gap2seq, sequip, snpeff: exact pin → >= floor
- muscle, novoalign: exact pin → >= floor with major version cap (<4)
- illumina-interop: kept pinned at 1.5.0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
samtools>=1.23.1 and bcftools>=1.23.1 force htslib>=1.23.1, which
conflicts with freebayes ARM64 conda package (needs htslib 1.19-1.21).
Revert to the floors that succeeded on prior builds; the solver still
installs the latest compatible versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
blast>=2.17.0 does not exist on ARM64. Revert to the floors from the
last successful build (blast>=2.15.0 and other classify tool floors).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use already_realigned_bam with minimap2 alignment instead of novoalign,
making tests runnable on ARM64. Remove @unittest.skipIf(IS_ARM) guards.

Expected output files are placeholders (copied from old GATK3+novoalign
versions) — will be updated with actual FreeBayes+minimap2 output after
running in Docker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hen not needed

Generate new expected FASTA files from FreeBayes+minimap2 pipeline:
- refine1: 6bp insertion at pos 1539 (82% of reads support it, QUAL=300)
- refine2: 49bp longer consensus (minimap2 maps more reads to ends)

Make novoalign instantiation conditional on already_realigned_bam so
refine_assembly works on ARM64 when using pre-aligned BAMs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 79.06977% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.25%. Comparing base (4761631) to head (cafa55b).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
src/viral_ngs/assembly.py 50.00% 3 Missing and 3 partials ⚠️
src/viral_ngs/assemble/freebayes.py 88.46% 1 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1053      +/-   ##
==========================================
- Coverage   58.45%   58.25%   -0.20%     
==========================================
  Files          66       66              
  Lines       14564    14504      -60     
  Branches     2687     2685       -2     
==========================================
- Hits         8514     8450      -64     
- Misses       5336     5337       +1     
- Partials      714      717       +3     
Flag Coverage Δ
assemble 63.13% <76.92%> (-0.58%) ⬇️
classify 63.87% <ø> (ø)
core 41.65% <25.00%> (-0.32%) ⬇️
phylo 43.70% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/viral_ngs/assemble/__init__.py 100.00% <100.00%> (ø)
src/viral_ngs/core/__init__.py 80.46% <ø> (-0.16%) ⬇️
src/viral_ngs/read_utils.py 73.94% <100.00%> (-0.81%) ⬇️
src/viral_ngs/assemble/freebayes.py 88.46% <88.46%> (ø)
src/viral_ngs/assembly.py 71.29% <50.00%> (-2.11%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dpark01 dpark01 marked this pull request as ready for review March 20, 2026 16:44
Copilot AI review requested due to automatic review settings March 20, 2026 16:44
@dpark01 dpark01 marked this pull request as draft March 20, 2026 16:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the refine_assembly consensus-refinement variant calling from GATK3 UnifiedGenotyper to FreeBayes, removes GATK-specific code/CLI hooks, upgrades bioinformatics tool dependencies to address CVEs, and introduces a CVSS v4.0-aware Trivy ignore policy to reduce false positives.

Changes:

  • Replace GATK UG calls in refine_assembly() with a new FreeBayes wrapper and remove GATK CLI commands/flags from read_utils.
  • Update integration/unit tests and expected FASTA fixtures to use minimap2-based alignment and FreeBayes outputs (including enabling ARM64 test execution).
  • Upgrade conda tool dependencies (Picard/fgbio/SnpEff/etc.) and add Trivy ignore policy rules (including CVSS v4.0 vector support) plus a documented .trivyignore exception.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/viral_ngs/assembly.py Switch refine_assembly() variant calling from GATK to FreeBayes; adjust indexing and docs accordingly.
src/viral_ngs/assemble/freebayes.py New FreeBayes tool wrapper with bgzip + tabix indexing support.
src/viral_ngs/assemble/__init__.py Export freebayes module from the assemble package.
src/viral_ngs/read_utils.py Remove GATK CLI commands and remove --skipRealign / --GATK_PATH from align_and_fix.
src/viral_ngs/core/__init__.py Stop importing gatk from viral_ngs.core.
src/viral_ngs/core/gatk.py Delete the GATK tool wrapper module.
tests/unit/core/test_read_utils.py Remove --skipRealign tests and update align_and_fix invocations accordingly.
tests/unit/assemble/test_assembly_integration.py Replace novoalign usage with minimap2 alignment + already_realigned_bam; update expected outputs to FreeBayes fixtures.
tests/unit/assemble/test_assembly.py Update refine_assembly edge-case tests to avoid novoalign and use minimap2 / already_realigned_bam.
tests/input/TestRefineAssembly/expected.ebov.refine1.freebayes.fasta New expected refine1 output fixture for FreeBayes pipeline.
tests/input/TestRefineAssembly/expected.ebov.refine2.freebayes.fasta New expected refine2 output fixture for FreeBayes pipeline.
docker/requirements/core.txt Remove GATK and upgrade core bioinformatics dependencies (Picard/fgbio/minimap2/etc.).
docker/requirements/assemble.txt Add freebayes>=1.3.6 and upgrade some assemble-related tools.
docker/requirements/core-x86.txt Relax pins for x86-only tools (novoalign/mvicuna).
docker/requirements/assemble-x86.txt Relax pin for gap2seq.
docker/requirements/phylo.txt Upgrade phylo-tool dependencies (bamtools/mafft/mummer4/muscle/snpeff).
.trivyignore Add documented ignore entry for CVE-2020-25649 in snpEff fat JAR.
.trivy-ignore-policy.rego Add CVSS v4.0 vector support and new ignore rules (incl. availability-only, scope-unchanged).
Comments suppressed due to low confidence (2)

src/viral_ngs/assembly.py:764

  • tmpVcf is always created with a .vcf.gz suffix and later copied verbatim to outVcf regardless of the outVcf filename/extension. If a caller passes outVcf ending in .vcf (not gz), they’ll still get gzipped content, which is surprising and can break downstream tools. Consider creating tmpVcf with the same compression as outVcf (or write directly to outVcf) and only generate/copy a .tbi when the destination is bgzipped.
    # Modify original assembly with VCF calls from FreeBayes
    tmpVcf = viral_ngs.core.file.mkstempfname('.vcf.gz')
    tmpFasta = viral_ngs.core.file.mkstempfname('.fasta')
    fb.call(realignBam, deambigFasta, tmpVcf)
    if already_realigned_bam is None:
        os.unlink(realignBam)
    os.unlink(deambigFasta)
    name_opts = []
    if chr_names:
        name_opts = ['--name'] + chr_names
    main_vcf_to_fasta(
        parser_vcf_to_fasta(argparse.ArgumentParser(
        )).parse_args([
            tmpVcf,
            tmpFasta,
            '--trim_ends',
            '--min_coverage',
            str(min_coverage),
            '--major_cutoff',
            str(major_cutoff)
        ] + name_opts)
    )
    if outVcf:
        shutil.copyfile(tmpVcf, outVcf)
        if outVcf.endswith('.gz'):
            shutil.copyfile(tmpVcf + '.tbi', outVcf + '.tbi')

src/viral_ngs/assembly.py:710

  • The inline comment still says "deambiguated genome for GATK" even though this function now calls FreeBayes and GATK has been removed. Updating this comment will help avoid confusion for future maintenance.
    # Sanitize fasta header & create deambiguated genome for GATK
    deambigFasta = viral_ngs.core.file.mkstempfname('.deambig.fasta')
    with viral_ngs.core.file.fastas_with_sanitized_ids(inFasta, use_tmp=True) as sanitized_fastas:
        deambig_fasta(sanitized_fastas[0], deambigFasta)

- Rename '# ========= GATK ==========' section header to 'align_and_fix'
- Remove unused IS_ARM import from test_assembly_integration.py
- Clean up .tbi temp file after FreeBayes VCF processing
- Fix stale 'deambiguated genome for GATK' comment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dpark01
Copy link
Copy Markdown
Member Author

dpark01 commented Mar 20, 2026

Regression testing plan

Two Terra submissions are running assemble_metagenomic with the upgrade-java-tools branch images on real production datasets:

Workspace Pathogen Samples Submission
broad-viral-surveillance/measles-usa Measles 178 07bd40fc-7d5b-4ae4-bd35-31fc8e0b3639
broad-viral-surveillance/Broad-Viral-Respiratory Respiratory viruses 192 f2974f32

Both workspaces have recent main-branch runs (within the past two months) on the same datasets, providing a direct baseline for comparison. Old runs: measles (4946925c-9ce2-4c16-9ef8-8e8b582dea7d), respiratory (fb58338a-cd32-456f-a298-f35bc48f58f6).

What we will compare

  1. Genome completeness -- Per-sample assembly_length_unambiguous, n_actg, n_ambiguous, n_missing, mean_coverage. Flag any samples with >1% completeness drop.

  2. Runtimes -- Per-sample wall clock time comparison. Expect similar or faster (no IndelRealigner step).

  3. Assembly FASTA pairwise diffs -- Download all assembly_fasta outputs from old vs new runs. For each sample, align old vs new consensus with mafft and compute: sequence identity, length difference, SNP count, indel count. Summarize with a histogram of identity scores.

  4. Spot-check divergent samples -- Any samples with >0.1% sequence divergence will be investigated with samtools pileup to determine which caller produced the more correct consensus.

Expected results

Based on unit/integration test analysis, we expect:

  • Most samples identical or near-identical
  • Minor differences at coverage boundaries (FreeBayes may extend consensus slightly at terminal regions)
  • Occasional indel differences where FreeBayes correctly calls variants that GATK3 UG missed (as seen in the refine1 unit test: 6bp insertion with 82% read support)
  • No regressions in genome completeness

Results will be posted here once the submissions complete (~1 day).

All 9 measles-usa Terra failures were caused by the WDL still passing
--skipRealign to align_and_fix. Accept the argument but ignore it since
GATK realignment has been removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dpark01
Copy link
Copy Markdown
Member Author

dpark01 commented Mar 21, 2026

Regression Testing Report: GATK → FreeBayes Assembly Comparison

Compared all assembly outputs between main (GATK3 UnifiedGenotyper) and upgrade-java-tools (FreeBayes) across two Terra workspaces. Per-segment alignment analysis ensures multi-segment genomes (e.g. influenza) are handled correctly — terminal coverage effects at segment boundaries are not misclassified as internal indels.


Measles (128 samples, 95 assembly comparisons)

Category Count %
Identical (100%) 53 55.8%
Near-identical (99.9-100%) 33 34.7%
Minor differences (99-99.9%) 7 7.4%
Significant differences (<99%) 2 2.1%
Metric Assemblies affected Events Bases
SNPs (A/C/G/T ↔ A/C/G/T) 1 10 10
Internal indels 9 11 92
Ambiguity diffs (N ↔ A/C/G/T) 42 779 779
Terminal extensions (old only) 2 3 36
Terminal extensions (new only) 2 0 0
Metric Median delta Mean delta Min Max
% Ref Covered 0.0000 -0.0002 -0.0078 0.0099
Mean Coverage 0.0000 -0.0002 -0.0762 0.1033
Unambig Length 0.0000 -3.71 -123 157
Divergent assemblies (identity < 99.9%) — 9 assemblies
Assembly ID Tax Name Identity SNPs Indel events (bp) Ambig Diffs Term Ext Old events (bp) Term Ext New events (bp)
VGG_24790-11234 Measles morbillivirus D8 98.737% 10 0 (0) 189 0 (0) 0 (0)
VGG_24766-11234 Measles morbillivirus 98.851% 0 0 (0) 171 0 (0) 0 (0)
VGG_24730-11234 Measles morbillivirus D8 99.573% 0 0 (0) 64 0 (0) 0 (0)
VGG_24687-11234 Measles morbillivirus D8 99.577% 0 0 (0) 63 0 (0) 0 (0)
VGG_24802-119210 Influenza A virus (A/Darwin/6/2021(H3N2)) 99.694% 0 0 (0) 35 2 (35) 0 (0)
VGG_24813-11234 Measles morbillivirus D8 99.744% 0 1 (1) 40 0 (0) 0 (0)
VGG_24689-11234 Measles morbillivirus D8 99.869% 0 0 (0) 21 0 (0) 0 (0)
VGG_24785-11234 Measles morbillivirus D8 99.873% 0 0 (0) 20 0 (0) 0 (0)
VGG_24744-11234 Measles morbillivirus D8 99.878% 0 0 (0) 19 0 (0) 0 (0)

Respiratory (176 samples, 141 assembly comparisons)

Category Count %
Identical (100%) 47 33.3%
Near-identical (99.9-100%) 35 24.8%
Minor differences (99-99.9%) 43 30.5%
Significant differences (<99%) 16 11.3%
Metric Assemblies affected Events Bases
SNPs (A/C/G/T ↔ A/C/G/T) 6 27 27
Internal indels 8 13 49
Ambiguity diffs (N ↔ A/C/G/T) 94 6943 6943
Terminal extensions (old only) 65 278 2505
Terminal extensions (new only) 65 7 3125
Metric Median delta Mean delta Min Max
% Ref Covered -0.0014 -0.0035 -0.0241 0.0011
Mean Coverage 0.0000 0.62 -0.62 10.97
Unambig Length -21.0 -60.70 -436 18
Divergent assemblies (identity < 99.9%) — 59 assemblies
Assembly ID Tax Name Identity SNPs Indel events (bp) Ambig Diffs Term Ext Old events (bp) Term Ext New events (bp)
USA-MA-Broad_MGH-23042-2025-641809 Influenza A virus (A/California/07/2009(H1N1)) 97.060% 0 0 (0) 311 0 (0) 0 (0)
USA-MA-Broad_MGH-23096-2025-119210 Influenza A virus (A/Darwin/6/2021(H3N2)) 97.471% 0 0 (0) 232 2 (140) 1 (1)
USA-MA-Broad_MGH-23170-2025-604436 Influenza B virus (B/Brisbane/60/2008) 97.901% 5 0 (0) 258 6 (143) 0 (0)
USA-MA-Broad_MGH-23142-2025-641809 Influenza A virus (A/California/07/2009(H1N1)) 98.017% 0 0 (0) 170 5 (134) 0 (0)
USA-MA-Broad_MGH-23134-2025-2697049 Severe acute respiratory syndrome coronavirus 2 98.452% 0 0 (0) 446 0 (0) 0 (0)
USA-MA-Broad_MGH-23011-2024-119210 Influenza A virus (A/Darwin/6/2021(H3N2)) 98.456% 0 0 (0) 118 6 (194) 0 (0)
USA-MA-Broad_MGH-23075-2025-208893 Human respiratory syncytial virus A 98.551% 0 0 (0) 212 0 (0) 0 (0)
USA-MA-Broad_MGH-23048-2025-11224 Human parainfluenza virus 4a 98.650% 0 0 (0) 222 0 (0) 0 (0)
USA-MA-Broad_MGH-23148-2025-2697049 Severe acute respiratory syndrome coronavirus 2 98.653% 0 0 (0) 397 1 (1) 0 (0)
USA-MA-Broad_MGH-23068-2025-463676 Rhinovirus C 98.759% 0 0 (0) 84 0 (0) 0 (0)
USA-MA-Broad_MGH-23047-2025-1219407 Rhinovirus C42 98.764% 0 0 (0) 80 0 (0) 0 (0)
USA-MA-Broad_MGH-23054-2025-208895 Human respiratory syncytial virus B 98.772% 0 0 (0) 164 0 (0) 0 (0)
USA-MA-Broad_MGH-23105-2025-2697049 Severe acute respiratory syndrome coronavirus 2 98.848% 1 3 (5) 338 1 (2) 0 (0)
USA-MA-Broad_MGH-23058-2025-208893 Human respiratory syncytial virus A 98.908% 0 0 (0) 154 1 (43) 0 (0)
USA-MA-Broad_MGH-23084-2025-11137 Human coronavirus 229E 98.965% 0 0 (0) 271 0 (0) 0 (0)
USA-MA-Broad_MGH-23003-2024-290028 Human coronavirus HKU1 98.992% 1 2 (4) 298 0 (0) 0 (0)
USA-MA-Broad_MGH-23129-2025-12730 Human respirovirus 1 99.036% 0 0 (0) 142 0 (0) 0 (0)
USA-MA-Broad_MGH-23090-2025-10533 Human adenovirus 1 99.088% 0 1 (2) 223 0 (0) 1 (3097)
USA-MA-Broad_MGH-23083-2025-2697049 Severe acute respiratory syndrome coronavirus 2 99.121% 0 0 (0) 259 1 (6) 0 (0)
USA-MA-Broad_MGH-23131-2025-208893 Human respiratory syncytial virus A 99.162% 0 0 (0) 116 1 (1) 0 (0)
USA-MA-Broad_MGH-23029-2025-31631 Human coronavirus OC43 99.218% 1 0 (0) 232 0 (0) 0 (0)
USA-MA-Broad_MGH-23149-2025-31631 Human coronavirus OC43 99.264% 0 2 (2) 224 0 (0) 0 (0)
USA-MA-Broad_MGH-23092-2025-2560525 Human orthorubulavirus 2 99.284% 0 0 (0) 111 0 (0) 0 (0)
USA-MA-Broad_MGH-23063-2025-10533 Human adenovirus 1 99.286% 0 0 (0) 107 0 (0) 0 (0)
USA-MA-Broad_MGH-23094-2025-11137 Human coronavirus 229E 99.309% 0 2 (5) 186 0 (0) 0 (0)
USA-MA-Broad_MGH-23054-2025-604436 Influenza B virus (B/Brisbane/60/2008) 99.313% 0 0 (0) 87 2 (34) 0 (0)
USA-MA-Broad_MGH-23102-2025-277944 Human coronavirus NL63 99.348% 0 0 (0) 160 0 (0) 0 (0)
USA-MA-Broad_MGH-23050-2025-641809 Influenza A virus (A/California/07/2009(H1N1)) 99.431% 0 0 (0) 68 3 (13) 0 (0)
USA-MA-Broad_MGH-23050-2025-11226 Human parainfluenza virus 4b 99.458% 0 0 (0) 77 0 (0) 0 (0)
USA-MA-Broad_MGH-23014-2025-208895 Human respiratory syncytial virus B 99.495% 0 0 (0) 76 0 (0) 1 (1)
USA-MA-Broad_MGH-23045-2025-11226 Human parainfluenza virus 4b 99.571% 0 0 (0) 74 1 (1) 0 (0)
USA-MA-Broad_MGH-23030-2025-604436 Influenza B virus (B/Brisbane/60/2008) 99.607% 0 0 (0) 51 3 (237) 0 (0)
USA-MA-Broad_MGH-23064-2025-277944 Human coronavirus NL63 99.636% 0 0 (0) 100 0 (0) 0 (0)
USA-MA-Broad_MGH-23107-2025-31631 Human coronavirus OC43 99.636% 0 0 (0) 111 0 (0) 0 (0)
USA-MA-Broad_MGH-23006-2024-208895 Human respiratory syncytial virus B 99.641% 0 0 (0) 54 1 (10) 0 (0)
USA-MA-Broad_MGH-23087-2025-208893 Human respiratory syncytial virus A 99.734% 0 0 (0) 40 0 (0) 0 (0)
USA-MA-Broad_MGH-23106-2025-208893 Human respiratory syncytial virus A 99.753% 0 0 (0) 37 0 (0) 0 (0)
USA-MA-Broad_MGH-23150-2025-147711 Rhinovirus A 99.755% 0 0 (0) 17 0 (0) 0 (0)
USA-MA-Broad_MGH-23159-2025-147685 Rhinovirus A78 99.761% 0 0 (0) 14 0 (0) 0 (0)
USA-MA-Broad_MGH-23137-2025-12730 Human respirovirus 1 99.774% 0 0 (0) 34 0 (0) 0 (0)
USA-MA-Broad_MGH-23152-2025-2697049 Severe acute respiratory syndrome coronavirus 2 99.780% 0 1 (26) 65 0 (0) 0 (0)
USA-MA-Broad_MGH-23117-2025-604436 Influenza B virus (B/Brisbane/60/2008) 99.785% 0 0 (0) 27 1 (1) 0 (0)
USA-MA-Broad_MGH-23113-2025-641809 Influenza A virus (A/California/07/2009(H1N1)) 99.802% 0 0 (0) 24 2 (20) 0 (0)
USA-MA-Broad_MGH-23144-2025-162145 Human metapneumovirus A 99.826% 18 0 (0) 5 0 (0) 0 (0)
USA-MA-Broad_MGH-23004-2024-1219407 Rhinovirus C42 99.841% 0 0 (0) 11 0 (0) 0 (0)
USA-MA-Broad_MGH-23005-2024-147711 Rhinovirus A 99.842% 0 0 (0) 11 0 (0) 0 (0)
USA-MA-Broad_MGH-23091-2025-1219377 Rhinovirus C3 99.853% 0 0 (0) 10 0 (0) 1 (22)
USA-MA-Broad_MGH-23026-2025-277944 Human coronavirus NL63 99.854% 0 0 (0) 40 0 (0) 0 (0)
USA-MA-Broad_MGH-23157-2025-119210 Influenza A virus (A/Darwin/6/2021(H3N2)) 99.855% 0 0 (0) 18 4 (136) 0 (0)
USA-MA-Broad_MGH-23125-2025-2697049 Severe acute respiratory syndrome coronavirus 2 99.859% 0 0 (0) 42 1 (1) 0 (0)
USA-MA-Broad_MGH-23138-2025-31631 Human coronavirus OC43 99.866% 0 0 (0) 41 1 (2) 0 (0)
USA-MA-Broad_MGH-23012-2025-641809 Influenza A virus (A/California/07/2009(H1N1)) 99.876% 0 0 (0) 16 7 (125) 0 (0)
USA-MA-Broad_MGH-23061-2025-277944 Human coronavirus NL63 99.876% 0 0 (0) 34 0 (0) 0 (0)
USA-MA-Broad_MGH-23176-2025-604436 Influenza B virus (B/Brisbane/60/2008) 99.880% 0 0 (0) 17 4 (32) 0 (0)
USA-MA-Broad_MGH-23166-2025-11137 Human coronavirus 229E 99.890% 0 1 (4) 30 0 (0) 0 (0)
USA-MA-Broad_MGH-23009-2024-604436 Influenza B virus (B/Brisbane/60/2008) 99.892% 0 0 (0) 15 5 (42) 1 (1)
USA-MA-Broad_MGH-23046-2025-208895 Human respiratory syncytial virus B 99.894% 0 0 (0) 16 0 (0) 0 (0)
USA-MA-Broad_MGH-23140-2025-2560525 Human orthorubulavirus 2 99.896% 0 0 (0) 16 0 (0) 0 (0)
USA-MA-Broad_MGH-23073-2025-463676 Rhinovirus C 99.900% 0 0 (0) 7 0 (0) 0 (0)

Key Takeaways

  1. Zero assembly count mismatches — both code paths produce the same set of assemblies for every sample in both datasets (304/304 samples).

  2. Differences are dominated by ambiguity resolution (N ↔ ACGT), not true variant calling differences. These reflect coverage depth differences at positions near the min_coverage threshold.

  3. True SNPs are rare: 10 SNPs in 1 measles assembly (VGG_24790, a D8 measles), 27 SNPs in 6 respiratory assemblies. The 18-SNP metapneumovirus (MGH-23144) is the only notable outlier.

  4. Internal indels are minimal: 11 events (92 bp) in measles, 13 events (49 bp) in respiratory. These are small, scattered events — no systematic bias.

  5. Terminal extensions are the main structural difference in flu samples. The old (GATK) code tends to extend further into low-coverage segment termini — these are positions at 1-4x coverage where the old code called bases and the new code masks them as N. BAM pileup validation (MGH-23096) confirmed these positions have borderline coverage where neither answer is clearly "correct."

  6. Coverage and % reference covered are essentially unchanged (median deltas ≈ 0). The slight negative trend in unambiguous length (median -21 bp in respiratory) reflects the more conservative masking at segment termini.


🤖 Generated with Claude Code

@dpark01 dpark01 marked this pull request as ready for review March 21, 2026 04:38
@dpark01
Copy link
Copy Markdown
Member Author

dpark01 commented Mar 21, 2026

VADR Annotation Quality: Old (GATK) vs New (FreeBayes) Assemblies

To assess whether the indel differences between old and new assemblies are biologically meaningful, we ran NCBI VADR on both versions of every assembly that had internal indel differences (excluding adenovirus and rhinovirus, which lack VADR models). VADR detects annotation errors including frameshifts, early stop codons, and other structural problems — fewer alerts generally indicates a more biologically plausible assembly.

15 assemblies × 2 versions = 30 VADR runs, executed via dsub on Google Batch using the staphb/vadr:1.6.4 image with organism-specific VADR models from viral-references.

Results

Assembly ID Tax Name Identity Indel events (bp) Ambig Diffs Mean Cov VADR alerts (old) VADR alerts (new) Delta
MGH-23094-11137 Human coronavirus 229E 99.309% 2 (5) 186 9.5 4 4 0
MGH-23166-11137 Human coronavirus 229E 99.890% 1 (4) 30 35.8 1 1 0
MGH-23003-290028 Human coronavirus HKU1 98.992% 2 (4) 298 3.5 25 18 -7
MGH-23008-277944 Human coronavirus NL63 99.993% 1 (1) 1 212.2 0 0 0
MGH-23149-31631 Human coronavirus OC43 99.264% 2 (2) 224 4.7 32 1 -31
VGG_24647-11234 Measles morbillivirus D8 99.916% 1 (1) 13 5.9 7 7 0
VGG_24673-11234 Measles morbillivirus D8 100.000% 1 (1) 0 1633.7 0 0 0
VGG_24696-11234 Measles morbillivirus D8 99.987% 1 (1) 2 9.4 4 0 -4
VGG_24720-11234 Measles morbillivirus D8 99.934% 2 (2) 10 9.4 8 0 -8
VGG_24727-11234 Measles morbillivirus D8 99.943% 2 (2) 9 32.8 4 9 +5
VGG_24734-11234 Measles morbillivirus D8 99.994% 1 (1) 1 36.8 4 0 -4
VGG_24791-11234 Measles morbillivirus D8 99.929% 1 (82) 11 3.5 4 5 +1
VGG_24813-11234 Measles morbillivirus D8 99.744% 1 (1) 40 8.9 12 12 0
MGH-23105-2697049 SARS-CoV-2 98.848% 3 (5) 338 7.1 28 2 -26
MGH-23152-2697049 SARS-CoV-2 99.780% 1 (26) 65 39.6 0 1 +1

Summary

  • FreeBayes (new) produces fewer VADR alerts in 7 of 15 assemblies, often dramatically:
    • HCoV-OC43 (MGH-23149): 32 → 1 alerts (-31)
    • SARS-CoV-2 (MGH-23105): 28 → 2 alerts (-26)
    • Measles (VGG_24720): 8 → 0 alerts (-8)
    • HCoV-HKU1 (MGH-23003): 25 → 18 alerts (-7)
  • Old code (GATK) wins in only 2 of 15 cases, both marginal (+5 and +1 alerts)
  • 6 of 15 are tied (identical alert counts)

Alert Detail Analysis

We examined the full VADR alert content (.alt.list files) for every assembly where the alert count changed. In every case where the count changed significantly, the difference is driven by frameshifts in coding regions — not by ambiguity (N) differences.

Cases where GATK introduced frameshifts that FreeBayes avoids

Assembly Gene(s) affected GATK indel causing frameshift Cascade
VGG_24696 (Measles) large polymerase 1bp insertion at pos 9561 shifts frame for remaining 6,110 bp FRAMESHIFT → STOP_CODON → UNEXPECTED_LENGTH
VGG_24720 (Measles) phosphoprotein + large polymerase 1bp deletion at 2386; 1bp insertion at 9928 Frameshifts in two genes, 8 total alerts → 0
VGG_24734 (Measles) large polymerase 1bp insertion at pos 9520 FRAMESHIFT → STOP_CODON → UNEXPECTED_LENGTH
MGH-23149 (OC43) ORF1ab + spike + envelope 1bp inserts at pos 527 and 3387; 1bp insert at 27613; 11bp insert at 28179 Frameshifts in 3 genes, 32 alerts → 1
MGH-23105 (SARS-CoV-2) ORF1ab 1bp inserts at pos 8144 and 14717 Frameshifts cascade to 13 PEPTIDE_TRANSLATION_PROBLEMs; 28 → 2 alerts
MGH-23003 (HKU1) ORF1ab 1bp insert at pos 6319 Frameshift cascades to 13 PEPTIDE_TRANSLATION_PROBLEMs; 25 → 18 alerts

In each case, GATK called a spurious 1bp indel in a coding region → frameshift → premature stop codon → cascading peptide translation failures across all downstream mature peptides. FreeBayes avoids these frameshifts entirely.

The one case where FreeBayes is worse

VGG_24727 (Measles, +5 alerts): Both old and new have the same hemagglutinin problem (stop codon at pos ~8884). But FreeBayes introduced a new frameshift in the large polymerase (1bp insert at 15261) plus a DUPLICATE_REGIONS alert. This is the single case where FreeBayes introduced a coding error that GATK avoided.

Cases where the difference is not frameshift-related

MGH-23152 (SARS-CoV-2, +1 alert): The new assembly's only alert is INDEFINITE_ANNOTATION_START on the surface glycoprotein — an ambiguity/coverage effect where the protein alignment can't extend to the 5' end. The 26bp indel between old and new did not cause a frameshift (it's likely in a non-coding region or is a multiple of 3).

VGG_24791 (Measles, +1 alert): Both versions share the same large polymerase frameshift (7bp deletion). The new assembly adds a LOW_SIMILARITY alert for a 39bp region, likely related to the 82bp indel. Net effect is neutral — the underlying problem exists in both.

Conclusion

The VADR analysis confirms that the indel differences between GATK and FreeBayes are predominantly frameshift-inducing errors in GATK assemblies that FreeBayes avoids. FreeBayes's haplotype-based reassembly approach produces more biologically plausible consensus sequences, particularly for coding regions in low-to-moderate coverage samples.

🤖 Generated with Claude Code

@dpark01 dpark01 merged commit 3c1e6d6 into main Mar 21, 2026
85 checks passed
@dpark01 dpark01 deleted the upgrade-java-tools branch March 21, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants