Releases: broadinstitute/gatk
4.0.3.0
This release brings a major update to our experimental neural-network-based VariantRecalibrator
replacement, initial MAF
support in Funcotator
, as well as some updates to Mutect2
and the CNV
tools.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Summary of changes in this release:
-
A major update to our experimental neural-network-based suite of variant scoring tools, which will eventually replace the
VariantRecalibrator
(#4245)- The
NeuralNetInferenceTool
has been renamed toCNNScoreVariants
- Baseline models are now included in the distribution.
- Added additional tools to write tensors and to train your own models given a VCF of validated calls, an unfiltered VCF and a confident region:
CNNVariantTrain
,CNNVariantWriteTensors
andFilterVariantTranches
- Read-level 2D models are now supported via the tensor-type read_tensor argument. 2D models at present are significantly slower than the 1D models.
- The
-
Funcotator
:- Added prototype support for outputting
MAF
files (and many bug fixes) (#4472)
- Added prototype support for outputting
-
Mutect2:
-
CNV
tools:- Replaced
CollectFragmentCounts
withCollectReadCounts
. (#4564) - Allowed use of zero eigensamples in
DenoiseReadCounts
. (#4411) - Changed filtering of normal hets on overlap with copy-ratio intervals in
ModelSegments
to be consistent with filtering of case hets. (#4510) - Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) (#4396)
- Replaced
-
Miscellaneous changes:
Concordance
: added option to analyze contributions of different filters (#4520)- Exposed the
-pairHMM
/--pair-hmm-implementation
argument inHaplotypeCaller
, which was previously hidden (#4494) - Set the default
samjdk.compression_level
to 2 (was previously 1) (#4547) - Upgraded to Spark 2.2.0 (#4314)
- Changed Spark sharding of queryname-sorted bams to better handle secondary and supplementary reads (#4473)
- Added logging output to the bam writing step for spark tools (#4501)
git-lfs
is now required to compile the GATK- Added a registry for deprecated/unported tools. (#4505)
- Updated the Hadoop GCS connector from 1.6.1 to 1.6.3. (#4590)
- Added a large runtime resource directory to
git-lfs
, and exposed it to the Docker build. (#4530) - We now include full tool documentation in the GATK binary distribution zip (#4377)
- Made our maven artifacts much smaller by preventing gradle uploadArchives from including distZip and distTar (#4569)
- Added chr20 and chr21 alt contigs to the
GRCh38
reference snippet used for testing (#4548)
4.0.2.1
This is a small bug fix release containing fixes for the following issues:
HaplotypeCaller
: fix the-contamination
/-contamination-file
arguments, which were not working properly, and add tests (#4455)- Fixes/improvements to the GATK configuration file mechanism (#4445)
- If a Java system property is specified explicitly on the user's command line, allow it to override the corresponding value in the GATK config file
- Bundle an example GATK configuration file with the GATK binary distribution. This config file can be edited and passed to the GATK via the
--gatk-config-file
argument. - There are still some configuration-related TODOs/known issues: in particular, the
gatk
front-end script currently bakes in some system properties internally, which will always override the corresponding values in the config file. We plan to patch thegatk
script to no longer set these system properties internally, and delegate to the config file instead.
Mutect2
: minor bug fixes and improvements (#4466)- Fix "FilterMutectCalls trips on non-int value in MFRL tag" (#4363)
- Fix ordering of allele trimming vs. variant annotation (#4402)
- Fix "CalculateContamination gives >100% results" (#3889)
- Disable the
MateOnSameContigOrNoMappedMateReadFilter
by default (#3514) - Make mapping quality threshold in
GetPileupSummaries
modifiable (#4011)
SV Tools:
Add a scan for intervals of high depth, and exclude reads from those regions from SV evidence (#4438)- In the GATK docker image, run the GATK using the fully-packaged binary distribution jars, rather than the unpackaged jars (#4476). This fixes a number of minor issues reported by users of the docker image.
4.0.2.0
This is a small release that includes a new Beta tool, a port of VariantAnnotator
from Gatk3, as well as some bug fixes and other improvements. Mutect2
is no longer beta.
-
Mutect2
andFilterMutectCalls
are now no longer beta! (#4384) -
new tool
VariantAnnotator
(#3803):- ported tool from GATK3
- first beta release
-
Spark Improvements
: -
new
CNV
Tumor only WDL (#4414) -
Viterbi segmentation and segment quality calculation for gcnvkernel (#4335)
-
Other Bug Fixes and Improvements:
- update to latest GKL, improves performance of GZIP level 2 compression (#4379)
CalculateGenotypePosteriors
fixed bug that caused duplicates in the output VCF as well as several other issues (#4352, #4431)- Display a more prominent warning message for Beta and Experimental tools. (#4429)
- non-zero Picard tool exit codes now cause a non-zero exit from gatk (#4437)
- removed support for deprecated Google Reference API (#4266)
- Improve evidence info dumps and SV pipeline management (#4385)
- oncotator docker uses default docker if not specified (#4394)
- Added check for non-finite copy ratios in ModelSegments pipeline. (#4292)
- make FASTQ reader remove phred bias from quals (#4415)
4.0.1.2
This is a small bug fix release to fix issues in the WDLs for Mutect2
and the CNV
tools. It also includes a newer version of the GKL
(Genomics Kernel Library) with some compression-related performance improvements.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
4.0.1.1
This is a small bug fix release that fixes the following:
- Fix sorting bug in
GatherTranches
. Gathered tranches should now be closer to target truth sensitivity in the lower range (~90%). Mutect2
WDL: fix memory requests to request MB instead of GB.- CNV somatic pair workflow WDL: added missing
Oncotator
optional arguments - Prevent printing a stack trace when the user specifies the name of a tool that doesn't exist. Instead print suggestions for similar tool names.
4.0.1.0
Highlights of this release include a preview version of a future neural-network-based VQSR replacement, the ability to generate a VCF from the GermlineCNVCaller
output, allele-specific annotation support in GenomicsDBImport
, as well as a number of important post-4.0 bug fixes. See below for the full list of changes.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Changes in this release:
- New experimental tool
NeuralNetInference
(#4097)- An eventual VQSR replacement.
- Performs variant score inference with a 1D Convolutional Neural Network with a pre-trained model. This is faster but not as high quality the 2D model which is coming along with training and tranche-style filtering in the next GATK release (#4245).
- Tool name subject to change!
GenomicsDBImport
:- Add support for allele-specific annotations (#4261) (#3707)
- Allow sample names with whitespace in the sample name map file (#3982)
- Fix segfault crash on long path names (#4160)
- Allow multiple import commands to be run in the same workspace directory (#4106)
- Fix segfault crash during import when flag fields not declared in the VCF header (#3736)
- Improve warning message when PLs are dropped for records with too many alleles (#3745)
- CNV tools:
HaplotypeCaller
- Fix the
--min-base-quality-score
/-mbq
argument, which previously had no effect (#4128). This fix also affectsMutect2
. - Fix a "contig must be non-null and not equal to *, and start must be >= 1" error by patching an edge case in the ReadClipper code: when reverting soft-clipped bases of a read at the start of a contig, don't explode if you end up with an empty read (#4203)
- Fix the
Mutect2
:- Smarter contamination model (#4195)
- Removed the
--dbsnp
and--comp
arguments. The best practice now is to pass ingnomAD
as thegermline-resource
. - Removed a number of other arguments that were
HaplotypeCaller
-specific and not appropriate forMutect2
, such as--emit-ref-confidence
. - Mutect2 WDL: CRAM support (#4297)
- Mutect2 WDL: Compressed vcf output and Funcotator options (#4271)
- Miscellaneous WDL cleanup
HaplotypeCallerSpark
:- Fixes to the tool that make its output much closer to that of the non-Spark
HaplotypeCaller
(#4278). Note that this tool (unlike the non-SparkHaplotypeCaller
) is still in beta, and should not be used for any real work. There are still major performance issues with the tool that in practice prevent running on certain kinds of large data and in certain modes. - Disallow writing a
.vcf.gz
when in GVCF mode, as this combination currently doesn't work (#4277)
- Fixes to the tool that make its output much closer to that of the non-Spark
BwaSpark
:- set more reasonable default set of read filters (#4286)
PathSeq
:- Add WDL for running the
PathSeq
pipeline with a README and example JSON input. (#4143)
- Add WDL for running the
- Fix piping between Picard tools run via the GATK by changing logging output to stderr (#4167)
- Disallow unindexed block-compressed tribble files as input to walkers (#4240) (#4224). This works around a bug in HTSJDK that could cause such files to appear truncated. Until the HTSJDK bug is fixed, block-compressed
.vcf.gz
files (and similar files) will need to be accompanied by an index, which can be generated using theIndexFeatureFile
tool. - Restore
.list
as an allowed extension for files containing multiple values for command-line arguments (#4270). The previous extension.args
is also still allowed. This feature allows users to provide a file ending in.list
or.args
containing all of the values for an argument that accepts multiple values (for example: a list of BAM files), instead of typing all the values individually on the command line. - Fix conda environment creation to work better with the release distribution. (#4233)
IndexFeatureFile
: more informative error message when trying to index a malformed file (#4187)- Suggest using BED files as a way to resolve ambiguous interval queries. (#4183)
- Set Spark parameter userClassPathFirst = false #3933 (#3946)
- Update to HTSJDK 2.14.1 (#4210)
4.0.0.0
4.beta.6
This release brings a critical bug fix to the GenomicsDBImport
tool related to sample ordering, plus a new tool FixCallSetSampleOrdering
to repair vcfs generated using the pre-4.beta.6
version of the tool. See the description of the bug in #3682 to determine whether you are affected. Do not run FixCallSetSampleOrdering
unless you are sure that you are affected by the bug in #3682.
Other highlights include upgrading to the latest version of the Picard tools, and adding engine support for reading Gencode GTF files.
A docker image for this release can be found in the broadinstitute/gatk
repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Note: Due to our current dependency on a snapshot of google-cloud-java
, this release cannot be published to maven central.
Full list of changes for this release:
- Fixed sample name reordering bug in GenomicsDBImport (#3667)
- New tool FixCallSetSampleOrdering to repair vcfs affected by #3682 (#3675)
- Integrate latest Picard tools via Picard jar. (#3620)
- Adding in codec to read from Gencode GTF files. Fixes #3277 (#3410)
- Upgrade to HTSJDK version 2.12.0 (#3634)
- Upgrade to GKL version 0.7 (#3615)
- Upgrade to GenomicsDB version 0.7.0 (#3575)
- Upgrade Mockito from 1.10.19 -> 2.10.0. (#3581)
- Add GVCF support to VariantsSparkSink (#3450)
- Fix writing variants to GCS buckets (#3485)
- Support unmapped reads in Spark. (#3369)
- Correct gVCF header lines (#3472)
- Dump more evidence info for SV pipeline debugging (#3691)
- Add omitFromCommandLine=true for example tools (#3696)
- Change gatkDoc and gatkTabComplete build tasks to include Picard. (#3683)
- Adding data.table R package. (#3693)
- Added a missing newline in ParamUtils method. (#3685)
- Fix minor HTML issues in ReadFilter documentation (#3654)
- Add CRAM integration tests for HaplotypeCaller. (#3681)
- Fix SamAssertionUtils SortSam call. (#3665)
- Add ExtremeReadsTest (#3070)
- removing required FASTA reference input that was needed before (for its dict) for sorting variants in output VCF, now using header in input SAM/BAM (#3673)
- re-enable snappy use in htsjdk (#3635)
- fix 3612 (#3613)
- pass read metadata to all code that needs to translate contig ids using read metadata (#3671)
- quick fix for broken read (mapped to no ref bases) (#3662)
- Fix log4j logging by removing extra copy from the classpath.#2622 (#3652)
- add suggestion to regularly update gcloud to README (#3663)
- Automatically distribute the BWA-MEM index image file to executors for BwaSpark (#3643)
- Have PSFilter strip mate number from read names (#3640)
- Added the tool PreprocessIntervals that bins the intervals given by the user to be used for coverage collection. (#3597)
- Cpx SV PR serisers, part-4 (#3464)
- fixed bug in which F1R2 and F2R1 annotation kept discarded alleles (#3636)
- imprecise deletion calling (#3628)
- Significant improvements to CalculateContamination (#3638)
- Adds supplementary alignment info into fastq output, also additional… (#3630)
- Adding tool to annotate with pair orientation info (#3614)
- add elapsed time to assembly info in intervals file (#3629)
- Created a VariantAnnotationArgumentCollection to reduce code duplication and added a StandardM2Annotation group (#3621)
- Docs for turning assembled haplotypes into variant alleles (#3577)
- Simplify spark_eval scripts and improve documentation. (#3580)
- Renames StructuralVariantContext to SVContext. (#3617)
- Added KernelSegmenter. (#3590)
- Fix bug in for allele order independant comparison (#3616)
- Docs for local assembly (#3363)
- Added a method to VariantContextUtils which supports allele alt allele order independant comparison of variant contexts. (#3598)
- Fixed incorrect logger in CollectAllelicCounts and RecalibrationReport. (#3606)
- updating to newer htsjdk snapshot (#3588)
- clear diffuse high frequency kmers (#3604)
- update SmithWatermanAligner in preparation for native optimized aligner (#3600)
- added spark tool for extracting original SAM records based on a file containning read names (#3589)
- update README with correct path to install_R_packages.R #3601 (#3602)
- HostAlignmentReadFilter and PSScorer use only identity scores and exp… (#3537)
- Fixed alt-allele count in AllelicCountCollector and changed unspecified alleles in AllelicCount to N. (#3550)
- Fix bad version check in manage_sv_pipeline.sh (#3595)
- Use a handmade TestReferenceMultiSource in tests instead of a mock. (#3586)
- Repackage ReadFilter plugin tests (#3525)
- BamOut in M2 WDL and unsupported version with NIO for SpecOps Team (#3582)
- Changed the path for posting the test reports
- updates sv manager and cluster creation scripts to utilize dataproc cluster timed self-termination feature (#3579)
- Implemented watershed algorithm for finding local minima in 1D data based on topological persistence. (#3515)
- Reduce number of output partitions in PathSeqPipelineSpark (#3545)
- add gathering of imprecise evidence links and extend evidence intervals to make links coherent in most cases (#3469)
- Refactor PrimaryAlignmentReadFilter to PrimaryLineReadFilter (#3195)
- Update ReadFilters documentation (#3128)
- Changes in BwaMemIntegrationTest to avoid a 3-4 minutes runtime. (#3563)
- Make error informative for non-diploid family likelihoods #3320 (#3329)
- TableFeature javadoc and more tests (#3175)
- Re-enable ancient BED test in IndexFeatureFile. (#3507)
- add external evidence stream for CNVs (#3542)
- clip M2 alleles before emitting in case some alleles were dropped (#3509)
- Docs for M2 filtering (#3560)
- Fix static test blocks and @BeforeSuite usages to prevent excessive code execution when tests aren't included in a suite. (#3551)
- hide prototyping tools in sv package from help message (but still runnable if knowing their existence) (#3556)
- Add support for running tools with omitFromCommandLine=true (#3486)
- Adds utility methods to ReadUtils and CigarUtils. (#3531)
- Cpx SV PR serisers, part-3 (#3457)
4.beta.5
Small release, includes highlights include an update to our BWA-MEM
version, an experimental PythonScriptExecutor
and an important bugfix for ValidateVariants -gvcf
mode
Note: this still includes snapshot dependencies that prevent us from releasing to Maven central.
Complete change list:
- Make directory name unique for BucketUtilsTest#testDirSizeGCS to avoid unwanted test interaction. (#3547)
- Simple PythonScriptExecutor. #3501 (#3536)
- Fix BucketUtils#dirSize on GCS. #3437 (#3539)
- code duplication in read pos rank sum and its allele-specific version #1882 (#2657)
- validatevariants -gvcf fix (#3530)
- Added
GetSampleName
as stopgap until we have named parameters (#3538) - Pair HMM docs (#3433)
- Fix MissingReferenceDictFile exception constructor. #3492 #2922 (#3524)
- Extend ReadsPipelineSpark to run HaplotypeCallerSpark (#3452)
- Updates bwamem-jni depedency to 1.0.2 and adds the possibility of aligning singletons to BwaEngine classes. (#3474)
- Structural Variant Context (#3476)
4.beta.4
Highlights of this release include fixes to the GATK4 HaplotypeCaller
to bring it closer to the output of the GATK3 HaplotypeCaller
(although many of these fixes still need to be applied to HaplotypeCallerSpark
), fixes for longstanding indexing and CRAM-related bugs in htsjdk, bash tab completion support for GATK commands, and many improvements to Mutect2
and the SV tools.
A docker image for this release can be found in the broadinstitute/gatk
repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Note: Due to our current dependency on a snapshot of google-cloud-java
, this release cannot be published to maven central.
Changes in this release:
HaplotypeCaller
: a number of important updates and fixes to bring it closer to GATK 3.x's output (most of these fixes apply only toHaplotypeCaller
, notHaplotypeCallerSpark
) (#3519)- reduce memory usage of the
AssemblyRegion
traversal by an order of magnitude - create empty pileup objects for uncovered loci internally (fixes occasional gaps between GVCF blocks as well as some calling artifacts)
- when determining active regions, only consider loci within the user's intervals
- port some additional changes to the GATK 3.x
HaplotypeCaller
to GATK4 - fix bug with handling of the
MQ
annotation
- reduce memory usage of the
- Added bash tab completion support for GATK commands (#3424)
- Updated to
Intel GKL
0.5.8, which fixes bug in AVX detection, which was behaving incorrectly on some AMD systems (#3513) - Upgrade
htsjdk
to 2.11.0-4-g958dc6e-SNAPSHOT to pick up an important VCF header performance fix. (#3504) - Updated
google-cloud-nio
dependency to 0.20.4-alpha-20170727.190814-1:shaded (#3373) - Fix tabix indexing bugs in htsjdk, and reenable the
IndexFeatureFile
tool (#3425) - Fix longstanding issue with CRAM MD5 slice calculation in htsjdk (#3430)
- Started publishing nightly builds
- Performance improvements to allow MD+BQSR+HC Spark pipeline to scale to a full genome (#3106)
- Eliminate expensive
toString()
call inGenotypeGVCFs
(#3478) ValidateVariants
gvcf memory optimization (#3445)- Simplified
Mutect2
annotations (#3351) - Fix MuTect2 INFO field types in the VCF header (#3422)
- SV tools: fixed possibility of a negative fragment length that shouldn't have happened (#3463)
- Added command line argument for IntervalMerging based on GATK3 (#3254)
- Added 'nio_max_retries' option as a command line accessible option for GATK tools (#3328)
- Fix aligned PathSeq input getting filtered by WellformedReadFilter (#3453)
- Patch the
ReferenceBases
annotation to handle the case where no reference is present (#3299) - Honor index/MD5 creation for HaplotypeCaller/Mutect2 bamouts. (#3374)
- Fix SV pipeline default init script handling (#3467)
- SV tools: improve the test bam (#3455)
- SV tools: improved filtering for smallish indels (#3376)
- Extends BwaMemImageSingleton into a cache, BwaMemImageCache, that can… (#3359)
- Try installing R packages from multiple CRAN repos in case some are down (#3451)
- Run Oncotator (optional) in the CNV case WDL. (#3408)
- Add option to run Spark tests only (#3377)
- Added a .dockerignore file (#3418)
- Code cleanup in the sv discovery package (#3361) and fixes #3224
- Implement PathSeq taxon hit scoring in Spark (#3406)
- Add option to skip pre-Bwa repartitioning in PSFilter (#3405)
- Update the GQ after PLs get subset (#3409)
- Removed the explicit System.exit(0) from Main (#3400)
- build_docker.sh can run tests again #3191 #3160 (#3323)
- Minor doc fixes #3173 (#3332)
- Use ReadClipper in BaseQualityClipReadTransformer (#3388)
- PathSeq adapter trimming and simple repeat masking (#3354)
- Add scripts to manage SV spark jobs and copy result (#3370)
- Output empty VQSLOD tranches in scatterTranches mode if no variant has VQSLOD high enough for requested threshold (#3397)
- Option to filter short pathogen reference contigs (#3355)
- Rewrote hapmap autoval wdl (#3379)
- fixed contamination calculation, added error bars to output (#3385)
- wrote wdl for Mutect panel of normals (#3386)
- Turn off tranches plots if no output Rscript is specified (for annotation plots) (#3383)
Mutect2
wdls output the contamination (#3375)- Increased maximum copy-ratio variance slice-sampling bound. (#3378)
- Replace --allowMissingData with --errorIfMissingData (gives opposite default behavior as previously) and print NA for null object in VariantsToTable (#3190)
- docs for proposed tumor-in-normal tool (#3264)
- Fixed the git version for the output jar on docker automatic builds (#3496)
- Use correct logger class in MathUtils (#3479)
- Make ShardBoundaryShard implement Serializable (#3245)