Releases: broadinstitute/gatk
4.1.5.0
Download release: gatk-4.1.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.5.0 release:
-
A new, improved version of the
--linked-de-bruijn-graphmode forHaplotypeCallerandMutect2that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394) -
A new version of
GenomicsDBthat fixes many frequently-reported issues -
LeftAlignIndelsnow works for multiple indels -
VariantAnnotatorandConcordanceare now out of beta -
A significant number of bug fixes to major tools like
GenotypeGVCFsandSelectVariants
Full list of changes:
-
HaplotypeCaller
- New, improved version of the
--linked-de-bruijn-graphmode forHaplotypeCallerandMutect2that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394)- Running
HaplotypeCallerin this mode will reduce the number of erroneous haplotypes discovered which can improve genotyping, phasing, and runtime. - Changed the haplotype recovery step to check that it covers all paths through the graph even if there are poorly supported paths in the JunctionTrees. Added the argument
--disable-artificial-haplotype-recoveryto disable this behavior. - Added the ability to expand graph kmer size after haplotype recovery in the event that there was a failure due to overcomplicated assembly graphs.
- Added code to squeeze extra sensitivity out of the junction trees by tolerating SNP errors when threading the junction trees themselves
- Running
- Realigning to best haplotype handles indels better (#6461)
- Fixed issue #5434 on inconsistent selection of reads for the PL, AD, and DP calculations. (#6055)
- Fixed bug where SNP and indel pseudocounts were swapped in the
AlleleFrequencyCalculator(#6401) - The qual used in
HaplotypeCaller'sisActive()method now matches that ofGenotypeGVCFs. That is, they both now use the new qual. (#6343) - Skip non-nucleotide alleles in force-calling mode, fixing bug (#6405)
- Fixed the hidden/experimental
--error-correct-readsargument to actually correct the bases and qualities (#6366) - Removed the deprecated and obsolete
--use-new-qual-calculatorargument (#6398) - Refactored code related to windows and padding for assembly and genotyping, with slight changes to HMM padding for indels (#6358)
- New, improved version of the
-
Mutect2
- Improved
SomaticClusteringModel(#6337) - Sped up Mutect2 reference confidence model with fast likelihoods model (#6457)
- Modified Fragment creation for Mutect2 to not fail for supplementary reads (#6327)
- Uniqify PG IDs in
FilterAlignmentArtifacts(#6304) - Fixed error in RealignmentEngine due to converting from exclusive to inclusive interval ends (#6404)
- Added an error message for no callable sites in Mutect2 (#6445)
- Changed filter reporting in Mutect2 (#6288)
- Fixed force-calling mode in M2 mito WDL (#6359)
- Pass the reference to the realignment filter in the Mutect2 WDL (#6360)
- Deleted the old orientation bias filter (#6408)
- Made callable sites a Long to avoid integer overflow (#6303)
- Improved
-
GenomicsDB
- Move to
GenomicsDB1.2.0 (#6305)- Fixes an issue with
GenomicsDBImporterroring out due to duplicate fields in the Info, Format, and/or Filter fields. (#6158) - Fixes an issue with
GenomicsDBImportnot completing for mixed ploidy samples (#6275) - This version uses a 64-bit htslib to workaround overflow issues when computed annotation sizes exceed the 32-bit integer space
- Fixes an issue with
- Move to
-
Joint Calling
GenotypeGVCFs: improved checking for upstream deletions in theGenotypingEngine(#6429)- Fixes rare cases where
GenotypeGVCFscould emit a variant with a spanned allele (*), and a genotype that references the spanned allele, but fail to emit the upstream spanning variant.
- Fixes rare cases where
GenotypeGVCFs: Don't call the NON_REF allele in genotypes or ADs (#6437)- Parse combined
AS_QUALapproxvalues from older reblocked GVCFs properly (#6442) - Added a force output sites argument to
GenotypeGVCFs(#6263) - Remove extraneous alleles in GenotypeGVCFs force-output mode (#6406)
-
CNV Calling
- Copy temporary files early in gcnvkernel to avoid inadvertent temporary directory cleanup. (#6297)
- Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. (#6266)
- Fixed shard index in PostprocessGermlineCNVCalls log message. (#6313)
- gCNV vcf cleanup (#6352)
- Index output VCFs for GCNV postprocessing (#6330)
-
Notable Enhancements
VariantAnnotatoris now out of beta (#6402)Concordanceis out of beta (#6397)LeftAlignIndelsnow works for multiple indels (#6427)FilterVariantTranchescan now handle cases where there are only SNPs or only indels, and not both (#6411)- Added new read filters for
NotProperlyPairedand forMateDistant(#6295) - Made the
.gitdirectory optional during build (#6450)
-
Bug Fixes
- Handle zero-weight Gaussians correctly in
VariantRecalibrator(#6425) - Fixed the
--invalidate-previous-filtersargument inVariantFiltrationto work as intended (ie., roll back all variants to unfiltered status) (#6412) - Fixed a bug where
SelectVariantstakes forever on many-allelic somatic samples (#6446) - Make sure
SelectVariantsoutputs variants in correct order (assuming input vcf is correctly sorted) (#6444) - Fixed a NPE crash in
VariantEvalwhen run with no intervals/reference (#6283) - Fixed a NPE crash in
FastaReferenceMaker(#6435) - Fixed an out-of-bounds error in
CountNsannotation (#6355) - Fixed a bug in hardClipCigar function that caused incorrect cigar calculation (#6280)
AnalyzeSaturationMutagenesis: fixed bug in codon calling for in-frame inserts (#6332)
- Handle zero-weight Gaussians correctly in
-
Miscellaneous Changes
- Collect split read and paired end evidence files for GATK-SV pipeline (#6356)
- Add "PASS" filter line for
ApplyVQSRandFilterMutectCalls(#6436) - Added engine functionality for accessing the user defined intervals without merging them (#5887)
- Trim intervals loaded from interval files. (#6375)
- Propagate read group filters in
ReadGroupBlackListReadFilter. (#6300) - Modified ANDed read filter output message for readability (#6315)
- Clearly label the number of reads processed in the
BaseRecalibratorlog output (#6447) - Clearly label the
CountReadstool output (#6449) - Improved the error messages for missing contigs in the reference (#6469)
- Avoid a copy and reverse operation in
CigarUtils.isGood()(#6439) - Fixed
GenotypeAlleleCount's toString() method (#6376) - Minor Funcotator WDL updates. (#6326)
- Added a
getPairOrientation()method toGATKRead(#6420) - Merged
GATKProtectedVariantContextUtilsmethods into other classes (#6409) - Deleted a lot of unused VCF constants (#6361)
- Deleted some unused genotyping code (#6354)
- Fixed incoherent unit test cases in allele subsetting utils (#6448)
- Add Python script executor error message for SIGKILL exit code 137. (#6414)
- Pip install pinned numpy. (#6413)
- Do not install R on travis, and only run the R tests on the Docker. (#6454)
- Fixes for
IndexFeatureFileerror reporting. (#6367) - Temporarily remove dead Berkeley mirror to unblock builds. (#6422)
- Disable CNNVariantPipelineTest.testTrainingReadModel until failures are resolved. (#6331)
- Delete unused JsonSerializer (#6415)
- Delete empty file SparkToggleCommandLineProgram.java. (#6311)
-
Documentation
- Clarify the definition of the
NON_REFallele (#6431) - Clarify behavior of
SplitIntervalsfor lists of adjacent intervals (#6423) - Update docs to reflect the fact that
TandemRepeatworks withHaplotypeCaller(#5943) - Update LeftAlignIndels documentation (#6177)
- Update hyperlink to new GATK forum page in the README (#6381)
- Add minValue/minRecommended value to ApplyBQSRArgumentCollection (#6438)
- Small README fixes (#6451)
- Fix some GATK doc issues (#6318)
- Update copyright date in LICENSE.TXT (#6383)
- Clarify the definition of the
-
Dependencies
4.1.4.1
Download release: gatk-4.1.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.4.1 release:
- New experimental
HaplotypeCallerassembly mode which improves phasing, reduces false positives, improves calling at complex sites, and has 15-20% speedup vs the current assembler. It is enabled with option--linked-de-bruijn-graph. This mode is still experimental and not recommended for production use yet. IndexFeatureFileimprovements:- now cloud enabled
- changed controversial
Fargument toIinstead.
- Bug fixes and improvements in
GenomicsDB,Mutect2, variant annotation, and more!
Full list of changes:
-
New Tools
PrintReadsHeader: a new tool to print a BAM/SAM/CRAM header to a file (#6153)
-
HaplotypeCaller
-
Mutect2
Mutect2now warns but does not fail when three or more reads have the same name. (#6240)- Fixed the random seed at the beginning of
FilterMutectCalls(#6208) GetSampleNameandGetPileupSummariesin the M2 pipeline are no longer beta. (#6215)- Increase number of iterations in
CalculateContaminationto 30. (#6282) - Handled an edge case with high scatter count in M2 WDL. (#6216)
- Use ArgumentsBuilder in M2 tests. (#6219)
-
Joint Calling
-
CNV Calling
- Fixed model parameter assignment typo in gCNV ploidy model (#6285)
- Added docker option to the gcnv QC tasks. (#6185)
- Added epsilons to overdispersion in gCNV models to avoid NaNs. (#6245) #4824 #6226 #6227
- Assert that ELBO did not become NaN during each step of inference of gCNV. (#6186)
- Added ability to override
THEANO_FLAGSenvironment variable in gCNV tools. (#6244) #6235 - Removed erroneous short argument names in R scripts for CNV plotting. (#6197)
-
GenomicsDB
- Allow GATK to configure annotation processing instead of hardcoding values in GenomicsDB GDB-39
- High ploidy sites with many genotypes no longer causes an overflow error. GDB-54
- Add missing libcurl in the native GenomicsDB library. #6122 GDB-66
- No longer crashes when vcfbufferstream from htslib appears to be invalid. GDB-67
- Propagated native GenomicsDB exceptions as java IOExceptions. GDB-68
- Fix issue when using vid protobuf interface and there is more than 1 config. GDB-70
- Cleanup GenomicsDB vid combine protobuf mapping overrides. #6190
-
Miscellaneous Changes
- Cloud-enable
IndexFeatureFileand change input arg name from -F to -I. (#6246) #6161 - WDL to run
ReadsPipelineSparkon a multicore machine (#6213) - Replace
TwoPassReadWalkerwith more generalMultiplePassReadWalker. (#6154) - Abolish unfilled likelihoods and revamp
VariantAnnotator. (#6172) - Improve exception message in
ValidateVariants. (#6076) - Fix Syntax Warning when running GATK with python 3.8 (#6231)
- Cloud-enable
-
Developer / Testing
-
Documentation
-
Dependencies
4.1.4.0
Download release: gatk-4.1.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.4.0 release:
-
Major improvements and fixes to
Mutect2, including more intelligent handling of paired reads during genotyping and better filtering. -
Important bug fixes to
HaplotypeCaller, the joint calling pipeline, andFuncotator -
Beta support for building/testing on Java 11 (#6119) (#6145)
- We encourage you to try this out and give us feedback!
Full list of changes:
-
New Tools
AlleleFrequencyQC: a QC tool that usesVariantEvalto bin variants in 1000 Genomes by allele frequency. For each bin, we compare the expected allele frequency from 1000 Genomes with the observed allele frequency in the input VCF. This was designed with arrays in mind, as a way to discover potential bugs in our pipeline. #6039)
-
Mutect2
Mutect2genotyping now forces paired reads to support the same haplotype (#5831)- New
FilterAlignmentArtifactsnow realigns a locally-assembled unitig of all variant read pairs (#6143) - Fixed a
Mutect2bug that overfiltered by one variant (#6101) - Fixed a small gene panel edge case for
CalculateContamination(#6137) - Fixed a small gene panel edge case in orientation bias filter (#6141)
- Unified the NIO and non-NIO M2 WDLs (call-caching will now work on Terra) (#6108)
- Updated
Mutect2pon WDL to WDL 1.0 (#6187) - Removed
Oncotatorfrom the M2 WDL (Funcotatoris still there) (#6144) - Fixed an issue in the M2 WDL that could cause the Funcotate task to be ignored by tools such as dxWDL (#6077)
- Some miscellaneous code refactoring/improvements (#6184) (#6136) (#6107) (#6159)
-
HaplotypeCaller
HaplotypeCallernow force-calls likeMutect2: the-genotyping-mode GENOTYPE_GIVEN_ALLELESargument is gone (now you only need to specify--alleles force-calls.vcf) and alleles are now force-called in addition to any other alleles (#6090)- Renamed
--output-mode EMIT_ALL_SITESto--output-mode EMIT_ALL_ACTIVE_SITES, and clarified the documentation for the argument (#6181) - Fixed a rare bug in the genotyping engine where it could emit untrimmed alleles for SNP sites (#6044)
- Fixed some sources of non-determinism in the
HaplotypeCallerthat in rare cases could cause the output to vary slightly given the same inputs (#6195) (#6104) - Deleted the old exact AF calculation model (#6099)
-
Joint Calling
- Fixed a regression in GATK 4.1.3.0 that caused us to not emit the
AS_QDannotation when running a joint calling pipeline withCombineGVCFs(GenomicsDBwas unaffected) (#6168) - Fixed allele-specific annotation array length issues when alleles are subset in tools such as
GenotypeGVCFs(#6079) - Changed
AS_RankSumoutputs to "." for missing values rather than "nul" (#6079)
- Fixed a regression in GATK 4.1.3.0 that caused us to not emit the
-
Funcotator
- Fixed a bug that caused
Funcotatorto outputs fields in wrong order in some cases when writing a VCF (#6178)- Specifically,
Funcotatorwould output functation fields in the wrong order when there was more than 1 site in a VCF data source with the exact same position and alleles and it matched one of the variants being annotated
- Specifically,
- Fixed a bug that caused
-
Mitochondrial pipeline
- Renamed the output vcf with the name of the sample and supplied a default value for
autosomal_median_coverage(meaning you'll now get theNuMTfilter even if you don't provide the actual autosomal coverage) (#6160)
- Renamed the output vcf with the name of the sample and supplied a default value for
-
Miscellaneous Changes
- Beta support for building/testing on Java 11 (#6119) (#6145)
UpdateVCFSequenceDictionarynow supports replacing an invalid sequence dictionary in a VCF (#6140)CountFalsePositivesnow requires an intervals file (#6120)AnalyzeSaturationMutagenesis: use supplementary alignments to identify large deletions (#6092)AnalyzeSaturationMutagenesis: an insert at the start codon is not in the ORF (#6121)- Added a check for null sequence dictionaries in the dictionary validation code (#6147)
- Update SV Spark pipeline example shell scripts saving results to GCS (#6114)
- Update public key for installing R in docker (#6116)
- Log exceptions during deletion on JVM exit instead of throwing (#6125)
- Don't fail the build if we're in a git worktree folder (#6169)
- Free a bit of memory fir the test suite by disabling mysql and postgress on travis (#6085)
- Delete bogus index files for queryname sorted CRAMs. (#6149)
- Cleanup GenomicsDB debugging test output (#6089)
-
Documentation
- Fixed mitochondria mode documentation in
FilterMutectCalls(#6174)
- Fixed mitochondria mode documentation in
-
Dependencies
4.1.3.0
Download release: gatk-4.1.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.3.0 release:
GnarlyGenotyper, a new beta joint genotyping tool which, along withReblockGVCF, forms part of a forthcoming more scalable version of our joint genotyping pipeline that we call the "GATK Biggest Practices" pipelineFuncotateSegments, a new beta companion tool toFuncotatorthat performs functional annotation on a segment file (.seg) rather than a VCFGenomicsDBImportnow has the ability to incrementally update an existing GenomicsDB workspace- Several important bug fixes to
HaplotypeCallerandMutect2
Compatibility notes:
GermlineCNVCallermodels built in cohort mode with previous releases are no longer compatible. Users should rebuild these models with this release before runningGermlineCNVCallerin case mode. See the CNV Tools section below for more details.
Full list of changes:
-
New Tools
-
GnarlyGenotyper (beta tool) (#4947) (#6075)
- The
GnarlyGenotyperis designed to perform joint genotyping on cohorts of at least tens of thousands of samples called withHaplotypeCallerand post-processed withReblockGVCFto produce a multi-sample callset in a super highly scalable manner. - Caveats:
GnarlyGenotyperis intended to be used with GVCFs for which low quality variants have already been removed, derived from post-processingHaplotypeCallerGVCFs withReblockGVCF. See the "Biggest Practices" usage example in theReblockGVCFdocs for details.GnarlyGenotyperdoes not subset alternate alleles and can return some highly multi-allelic sites. PLs will not be output for sites with more than 6 alts to save space.GnarlyGenotyperassumes all diploid genotypes
- Annotations:
- To generate all the annotations necessary for VQSR, input variants to the
GnarlyGenotypermust include theQUALapproxandVarDPannotations along with the latestRAW_MQandDPannotation. - If allele-specific annotations are present, they will be used appropriately and a new
AS_AltDPannotation giving the total depth across samples for each alternate allele will be added.
- To generate all the annotations necessary for VQSR, input variants to the
- A GATK "Biggest Practices" pipeline including the
GnarlyGenotyperis forthcoming pending some fixes improving on the above caveats.
- The
-
FuncotateSegments (beta tool) (#5941)
- A companion tool to
Funcotatorthat performs functional annotation on a segment file (.seg) rather than a VCF - The Somatic CNV pipeline can optionally run this tool for functional annotation
- A companion tool to
-
-
HaplotypeCaller/Mutect2
- Fixed a regression in
HaplotypeCaller/Mutect2that caused some variants to be lost at sites with high complexity (#5952) - Fixed a GGA (GENOTYPE_GIVEN_ALLELES) mode bug in
HaplotypeCaller/Mutect2where added alleles' cigars could have soft clips (#6047)- This bug would manifest as a "Cigar cannot be null" error
- Fixed a bug where cached indel informativeness values could be incorrectly applied to the wrong sites in
HaplotypeCaller/Mutect2(#5911) - Fixed an edge case in
HaplotypeCaller/Mutect2where dangling end merging creates cycles (#5960) - Added hidden arguments to the assembly engine to track found haplotype counts and kmers used (#6049)
- Fixed a bug in
CalculateContaminationwhen contamination is indistinguishable from zero (#5971) - Fixed a bug where normal p value argument in
FilterMutectCallswas declared static (#5982)
- Fixed a regression in
-
CNV Tools
- Added
FuncotateSegmentsas an option to the Somatic CNV WDL (#5967) - Added QC metrics to the Germline CNV workflow (#6017)
- Enabled GC-bias correction by default in CNV workflows (#5966)
- Added denoised coverage file concatenation output to gCNV postprocessor (#5823) Note: The addition of this feature breaks compatibility with gCNV cohort-mode models built with previous releases.
- Changed cr.igv.seg output of ModelSegments to give log2 Segment_Mean. (#5976)
- Fixed CNV plotting script to allow spaces in input filenames. (#5983)
- Added
-
GenomicsDBImport
- Added support for making incremental updates to existing workspaces (#5970)
- This can be done using the new
--genomicsdb-update-workspace-pathargument
- This can be done using the new
- Fixed a crash in
GenomicsDBImporton queries at positions inside deletions (#5899) - Treat AS_QUALapprox and AS_VarDP strings as array of int vectors (#5933)
- Added support for making incremental updates to existing workspaces (#5970)
-
Mitochondrial Calling Pipeline
- Added NIO support and updated to WDL 1.0 (#6074)
-
Spark Tools
- Removed the beta label from many simple Spark tools (#5991)
- Bug fix for reading references from GCS on Spark (#6070)
- Eliminated an unnecessary sort step in
HaplotypeCallerSpark(#5909) - Fixed
BaseRecalibratorSparkfailure on a cluster due to system classloader issue (#5979) - Added a WDL for
ReadsPipelineSpark(#5904) - Added a command-line argument to toggle using NIO on reading for Spark (#6010)
- Added advanced arguments to
MarkDuplicatesSparkto allow non-queryname sorted inputs when specifying multiple input bams and to treat unsorted inputs as queryGroup-sorted (#5974) - Clarified the behavior of
MarkDuplicatesSparkwhen given multiple input bams, and improved the sorting behavior if given a mix of queryname-sorted and query-grouped bams (#5901) - Changed
spark.yarn.executor.memoryOverheadtospark.executor.memoryOverheadas promoted by Spark 2.3 (#6032) - Handle newly-added arguments in
ApplyBQSRUniqueArgumentCollection(#5949)
-
Miscellaneous Changes
- Added a new
BaseQualityHistogramvariant annotation to generate base quality histograms (#5986) - Added a new
SoftClippedReadFilterthat can filter out reads where the ratio of soft-clipped bases to total bases exceeds some given value (#5995) - Fixed a serious bug in
ValidateVariantswhere the tool would silently do no validation in the default case when a DBSNP file was not provided (#5984) - Fixed a "Record covers a position previously traversed" error in
ValidateVariantsfor GVCFS with multiple contigs (#6028) - The
RMSMappingQualityannotation now requires the--allow-old-rms-mapping-quality-annotation-dataargument to run with GVCFs created by older versions of the GATK (#6060) - Added a simple TSV/CSV/XSV writer with cloud write support as an alternative to TableWriter (#5930)
Funcotator: added Funcotator stand-alone WDL to supported area (#5999)- Extracted the
GenotypeGVCFsengine into publicly accessible class/function (#6004) - Refactored
VariantEvalmethods to allow subclasses to override (#5998) AnalyzeSaturationMutagenesis: arbitrarily choose 1 read for disjoint pairs, dump rejected reads, and various other improvements (#5926) (#6043)- Normalized some AssemblyRegion args in
HaplotypeCallerSpark(#5977) - Don't redundantly delete temporary directories in
RSCriptExecutor(#5894) - Treat all source files as UTF-8 for java, javadoc (#5946)
- Updated an out-of-date argument name in an error message for the
CycleCovariate - Changed an error about "duplicate feature inputs" to be a UserException (#5951)
- Got rid of
ExpandingArrayListin favor ofArrayList(#6069) - Disabled Codecov for now on travis due to spurious errors (#6052)
- Lowered the Xms value in the test JVM (#6087)
- Updated the travis installed R version to 3.2.5, matching our base docker image (#6073)
- Fixed an erroneous warning about GCS test configuration (#5987)
- Added a code of conduct (#6036)
- Added a new
-
Documentation
FilterVariantTranchesdocumentation fix and improvement (#5837)- Updated
FilterMutectCallsusage examples (#5890) - Added
--max-mnp-distance 0to usage example inCreateSomaticPanelOfNormalsdocs (#5972) - Updated the
MarkDuplicatesSparkdocumentation to no longer contain a misleading usage example (#5938) - Added a clarification to the README to warn users to set their Gradle JVM properly in Intellij after setup (#6066)
- Added links to download Java 8 to the README (#6025)
- Remove non-ascii chars from javadoc (#5936)
-
Dependencies
4.1.2.0
Download release: gatk-4.1.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.2.0 release:
- Two new tools,
MethylationTypeCallerandAnalyzeSaturationMutagenesis(see below for descriptions) - Significant improvements to
GENOTYPE_GIVEN_ALLELESmode inMutect2andHaplotypeCaller - Fixed a serious bug in
Funcotatorthat could cause END positions to be wrong for some deletions in MAF output - Significant updates to the mitochondrial calling pipeline
Full list of changes:
-
New Tools
- MethylationTypeCaller (#5762)
- Identifies methylated bases from bisulfite sequencing data. Given a bisulfite sequenced, methylation-aware aligned BAM and a reference, it outputs methylation-site coverage to a specified output vcf file.
- AnalyzeSaturationMutagenesis (#5803)(#5883)
- Processes reads from a saturation mutagenesis experiment, an experiment that systematically perturbs a mini-gene to ascertain which amino-acid variations are tolerable at each codon of the open reading frame. Its main job is to discover variations from wild-type sequence among the reads, and to summarize the variations observed.
- MethylationTypeCaller (#5762)
-
Mutect2
- Made significant improvements to
GENOTYPE_GIVEN_ALLELESmode inMutect2andHaplotypeCaller(#5874). These improvements are described in more detail in #5857 CalculateContaminationnow works much better for very small gene panels (#5873)- We now correctly handle inputs with 100% contamination in
Mutect2filtering (#5853) Mutect2now uses natural logarithms internally (#5858). This does not change any outputs.- Minor update to the
Mutect2PON WDL (#5859)
- Made significant improvements to
-
Funcotator
- Fixed a serious bug that could cause END positions to be wrong for some deletions in MAF output (#5876)
- The tool now throws a user error for an AD field with only 1 value in MAF mode (#5860)
- Added a new filter to
FilterFuncotations. For two autosomal recessive genes, MUTYH and ATP7B, homozygous variants and compound heterozygous variants will be tagged and added to the output vcf. (#5843)
-
Mitochondrial Calling Pipeline
- Updated the pipeline for the new
Mutect2filtering scheme and pulled filtering after the liftover and recombining of the VCF. (#5847) - Made the subsetting of the WGS bam fast by using
PrintReadsover just chrM instead of traversing the whole bam for NuMT mates. (#5847) - Moved polymorphic NuMTs based on autosomal coverage to a filter (it was an annotation before) (#5847)
- Added an option to hard filter by VAF (#5847)
- Bug fix for large input files to the mitochondrial pipeline (we now include the size of the input BAM/CRAM when calculating disk size, when necessary) (#5861)
- Updated the pipeline for the new
-
Structural Variation Calling Pipeline
- Bug fix to
QNameFinderto handle reads with negative unclipped starts (#5864)
- Bug fix to
-
Miscellaneous Changes
- Added a
--min-fragment-lengthargument to theFragmentLengthReadFilter(#5886) - Added a
--spark-verbosityargument to control verbosity of Spark-generated logs (#5825) - Added a new
WalkerBaseabstract class to be used for all built-in walkers (#4964) - Exposed transient attributes in the
GATKReadAPI (#5664) - Convert more code to use
GATKPathSpecifier(#5870) (#5832). This also fixes anInvalidPathExceptionon Windows machines. - Fixes to the test suite related to the recent introduction of a codec for Picard interval lists (#5879)
- Eliminated an error message during the Docker build in Travis logs by creating a directory before copying to it. (#5878)
- Added a
-
Documentation
4.1.1.0
Highlights of the 4.1.1.0 release:
- A substantial (~33%) speedup to the
HaplotypeCallerin GVCF mode (-ERC GVCF) - Major updates to
Mutect2, including completely overhauled filtering and smarter handling of overlapping read pairs. - A tensorflow update for
CNNScoreVariantsthat speeds up the tool by roughly ~2X when using the 2D model. - Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
- Important bug fixes to
Funcotator,VariantEval,GenomicsDBImport, and other tools, as well as to the--pedigreeargument for annotations.
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes:
-
HaplotypeCaller
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
- This speeds up whole-genome GVCF mode calling (
-ERC GVCF) by ~33% in our tests!
- This speeds up whole-genome GVCF mode calling (
- Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
- Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
- Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
- Added a
--force-activeargument that marks all regions as active. Useful for debugging/diagnostics. (#5635) HaplotypeCallerSpark: made performance improvements to allow the tool to run on WGS in strict mode (#5721)- Fixed rare infinite recursion bug in
KBestHaplotypeFinder(also affectsMutect2)(#5786)
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
-
Mutect2
- Overhaul of
FilterMutectCalls, which now applies a single threshold to an overall error probability (#5688)FilterMutectCallsautomatically determines the optimal threshold.- The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
- Includes a rewrite of
Mutect2documentation -- better organization and now includes command line examples in addition to math.
Mutect2now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)- This especially improves indel sensitivity.
- Optimized
Mutect2read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840) - New
Mutect2panel of normals workflow usingGenomicsDBfor scalability (#5675)- Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
- Rewrote
Mutect2active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814) Funcotatorupdates inMutect2WDL (#5742) (#5735)- Prune assemby graph before checking for cycles (#5562)
- Refactor
Mutect2inheritance so that it doesn't have inactive arguments (#5758) - Added CRAM support to the
Mutect2WDL (#5668) - Split MNPs in
Mutect2PON WDL, fixing a potential bug (#5706) - Handle negative infinity log likelihoods from PairHMM in
Mutect2(#5736) - Fixed overfiltering in
Mutect2in GGA alleles mode with no reads (#5743) - Correct some
Mutect2VCF header lines (#5792) - Handle unmarked duplicates with mate MQ = 0 in
Mutect2(#5734) - Output sample names in
Mutect2PON header (#5733) - Avoid error due to finite precision error in
Mutect2PON creation (#5797) - Update
Mutect2javadoc to reflect v4.1 changes. (#5769) - Renamed the
OxoGReadCountsannotation toOrientationBiasReadCounts(#5840)
- Overhaul of
-
CNNScoreVariants
- We now use the latest Intel-optimized tensorflow (#5725)
- This speeds up the 2D CNN by roughly 2X in our tests!
FilterVariantTranchesis out of beta (#5628)- Fixed
CNNScoreVariantshanging when the conda environment is not set up (#5819)- We now make sure that the GATK tool Python package is present before executing streaming Python commands.
- Extensive updates to the CNN WDLs (#5251)
- We now use the latest Intel-optimized tensorflow (#5725)
-
Mitochondrial Calling Pipeline
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
- Fixes a large number of missed calls
- Use adaptive pruning in the mitochondria pipeline (#5669)
- Changed defaults in mitochondria mode in response to
Mutect2filtering overhaul (#5827) - Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
- Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
- Updated the
haplocheckerversion to0.1.2to fix a bug with flipping the major and minor hg headers in its output (#5760) - Added the rest of the mitochondria joint-calling pipeline (#5673)
- Merging and genotyping "somatic" GVCFs from
Mutect2
- Merging and genotyping "somatic" GVCFs from
- Added a read filter for unmapped reads and their mates (#5826)
- Refactored the MT WDL to make validations easier (#5708)
- Updated a variable name in MT WDL to match gatk-workflows version (#5694)
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
-
GenotypeGVCFs
- Added an option to merge intervals for better
GenotypeGVCFsperformance onGenomicsDBexome input (#5741) - Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
GenotypeGVCFsnow uses the header info to determine if FORMAT lists need to be subset when alleles are dropped- Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
- Added an option to merge intervals for better
-
Funcotator
- Non-locatable data sources can create funcotations again (#5774)
- Fixes a bug where
Funcotatorwas not adding funcotations from non-locatable data sources
- Fixes a bug where
- Fixed handling of symbollic alleles when determining best transcript for
GencodeFuncotationcreation. (#5834) FilterFuncotations: support for multi-allelic variants (#5588)FilterFuncotations: support for gnomAD for allele frequency inClinVarFilterandLofFilter, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)- Added
#as a character to be sanitized byVCFOutputRenderer(#5817) - Added in Markdown files for Funcotator forum posts (#5630)
- Updated
Funcotatordocumentation with a FAQ section to respond to user comments (#5755)
- Non-locatable data sources can create funcotations again (#5774)
-
CNV Tools
- Improved memory usage in gCNV (#5781)
- Improved memory requirements of
CollectReadCounts(#5715) - Added some fixes for minor CNV issues (#5699)
- Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
- Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
-
Miscellaneous Changes
SelectVariantscan now write VCF outputs to Google Cloud Storage (GCS) (#5378)VariantEvalbug fix: don't require the output file to already exist (#5681)- Fixed the
--pedigreeargument in thePossibleDeNovoannotation (#5663) GenomicsDBImport: fixed a core dump when querying overlapping deletions (#5799)GatherPileupSummaries: a new tool that combines the output ofGetPileupSummariesfrom disjoint scatter jobs (#5599)VariantsToTable: add splitting for allele-specific annotations and ADs (#5697)CalculateGenotypePosteriors: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)- Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
ReadsPipelineSpark: fixed an "Interval not within the bounds of a contig" error (#5645)Concordance: fixed the tool to allow for no variation alleles in the truth data. (#5718)ReblockGVCF: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)- Change
UpdateVCFSequenceDictionaryto use the specified dictionary uniformly (#5093) - Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
- Print the Picard/HTSJDK versions in addition to the GATK version when running with
--version(#5757) IndexFeatureFile: fixed a crash on VCFs with 0 records (#5795)PrintBGZFBlockInformation: removed the file extension check so that we can accept bams (#5801)- Added a new read filter:
IntervalOverlapReadFilter(#5656) - Add NIO Path support to
TableReaderandTableWriter(#5785) - Replaced
IntervalsSkipListwithOverlapDetector(#4154) - Removed some unused arguments in VCF merging code (#5745)
- Kebab-case some arguments in
LocusWalkerandLocusWalkerSpark(#5770) - Removed an unnecessary IllegalArgumentException in
PairHMM(#5705) - Removed accidental uses of log4j v1 (#5682)
- Improvements to Spark evaluation scripts (#5815)
- Extract tests from
PrintReadsIntegrationTestto share with the Spark version. (#5689)
-
Documentation
- Improved the documentation for the
StrandOddsRatioannotation (#5703) - Fixed the descriptions of some
HaplotypeCallerarguments (#5658) - Update
VariantRecalibratorexample code to reflect new tagged argument syntax (#5710) - Corrected javadoc for the
InbreedingCoeffannotation (#5768) CalculateGenotypePosteriors: minor updates to javadoc and logger type (#5601)- Added and Updated javadoc for
SortSamSparkandMarkDuplicatesSpark(#5672) - Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
- Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
- Trimmed overly-long tool...
- Improved the documentation for the
4.1.0.0
It's been a year since the GATK 4.0.0.0 release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0!
To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0 on the official GATK blog.
Below we've compiled the highlights of the new features added between versions 4.0.0.0 and 4.1.0.0. If you're interested in seeing only the changes between the last release (4.0.12.0) and this release (4.1.0.0), click here instead.
Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/
Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):
-
Next-Gen VQSR Replacement For Single-Sample
- New suite of tools
CNNScoreVariants,CNNVariantTrain,CNNVariantWriteTensors, andFilterVariantTranches CNNScoreVariantsis now out of beta and ready for production use- Performs variant training and scoring using a convolutional neural network.
- Single-sample only
- Produces better results than the legacy
VariantRecalibrator(VQSR) and comparable or better results to third-party tools likeDeepVariant - Sophisticated 2D model that uses the reads
- New suite of tools
-
Major HaplotypeCaller Improvements
- Now genotypes and outputs spanning deletions
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distanceargument - Important fix to the reference confidence calculation upstream of indels
- New
HaplotypeCallerpriors for variants sites and homRef blocks- Added new
--population-callsetargument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors - Added new
--num-reference-samples-if-no-callargument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
- Added new
-
Major Mutect2 Improvements
Mutect2is now out of beta- Support for multi-sample calling
- Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distanceargument - Added a genotype given alleles (GGA) mode
- New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
- Many new/improved filters to reduce false positives (eg.,
FilterAlignmentArtifacts) - Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
- New probabilistic orientation bias tool
- Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
- Big improvements to CalculateContamination, especially when tumor has lots of CNVs
- NIO support in Mutect2 WDL
- Significant speed improvements
- Improved allele fraction estimation
- Initial GVCF output support
-
Mitochondrial Calling
- Added
--mitochondria-modetoMutect2andFilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria.
- Added
-
New allele frequency / qual score model
- Is now the default in
HaplotypeCallerandGenotypeGVCFs - Optimized for greater speed, should resolve many
GenotypeGVCFsmemory issues - Rare numerical finite precision issues in the allele-specific qual have been resolved
- Is now the default in
-
Major Improvements to the CNV (Copy Number Variation) tools
- The CNV tools are now out of beta.
- This includes the tools:
AnnotateIntervals,CallCopyRatioSegments,CollectAllelicCounts,CollectReadCounts,CreateReadCountPanelOfNormals,DenoiseReadCounts,DetermineGermlineContigPloidy,FilterIntervals,GermlineCNVCaller,ModelSegments,PostprocessGermlineCNVCalls,PreprocessIntervals,PlotDenoisedCopyRatios, andPlotModeledSegments
- This includes the tools:
- Completed the
GermlineCNVCaller(gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs. - Major changes include the addition of new tools (
PostprocessGermlineCNVCalls,FilterIntervals, andCollectReadCounts, which replacesCollectFragmentCounts), as well as improvements to existing tools (notably,AnnotateIntervals). - Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the
ModelSegmentssomatic CNV pipeline, and CRAM support for all CNV WDLs. - Developed tools and WDLs for tagging and filtering of germline events in the
ModelSegmentssomatic CNV pipeline.
- The CNV tools are now out of beta.
-
Funcotator Official Release
- Funcotator is now out of beta
- Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
- Some new features include:
- MAF output support
- NIO support for datasources
- gnomAD support
- dbsnp support
- Support for Mitochondrial amino acid sequence/protein change strings
- 5'/3' flank support
- Major performance improvements due to added caching
- Added ALL mode for transcript selection (
--transcript-selection-mode ALL) which will output full annotation fields for all transcripts
- Created a new
FuncotatorDataSourceDownloadertool to download data sources - Added an experimental
FilterFuncotationstool
-
MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates
- MarkDuplicatesSpark is now out of beta
- Rewritten version of the tool matches Picard
MarkDuplicatesoutput and has greatly improved performance and scalability - Supports multiple BAM inputs
- Indexes BAM outputs on-the-fly in parallel on a cluster
-
Additional Tools Ported from GATK3
- Ported
VariantAnnotator - Ported
VariantEval - Ported
FastaAlternateReferenceMakerandFastaReferenceMaker - Ported
LeftAlignAndTrimVariants - Restored
GenotypeGVCFs--include-non-variant-sitesargument
- Ported
-
Major Improvements to the SV (Structural Variation) Tools
- Improvements to collection and calling of events based on discordant read pair evidence.
- A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
- Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
- A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
- A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
-
Spark Improvements
- New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
HaplotypeCallerSparknow has a "strict mode" that closely matches the regularHaplotypeCaller- Created
RevertSamSpark, a parallelized Spark version of Picard'sRevertSamtool - Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
-
GenomicsDB Improvements
- Allele-specific annotation support
- Multi-interval support (with some performance caveats)
- Support for sites-only queries
- Support for returning the GT field in queries
- New protobuf-based API to allow configuration without editing JSON files
- Added in machinery to allow per-annotation combine operations to be specified
- Allow for hdfs and gcs URI's to be passed to GenomicsDB
- Migrated from
com.intel.genomicsdbtoorg.genomicsdb
-
"Goodies" Worth Mentioning
- Added fasta.gz support to the
-R/--referenceargument in walker tools SelectVariantscan now drop specific annotation fields from the output vcfCalculateGenotypePosteriorsnow supports indels- New tool
ReblockGVCFto merge reference blocks in single-sample GVCFs for smaller filesizes - Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
- The
-Largument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools - Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new
--gcs-project-for-requester-paysargument - Added GCS (Google Cloud Storage) output (-O) support to more tools
- Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
- A significantly (~33%) smaller GATK docker image
- Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
- Affects command-line interface for
VariantRecalibrator,VariantEval,VariantFiltration, andVariantAnnotator
- Affects command-line interface for
- Added fasta.gz support to the
Changes between versions 4.0.12.0 and 4.1.0.0 only:
- Many tools are now out of beta and ready for production use!
- `CNNScor...
4.0.12.0
Highlights of this release include support for outputting phased variants in HaplotypeCaller/Mutect2, restoring the --include-non-variant-sites argument to GenotypeGVCFs, a port of the GATK3 tool VariantEval, a new library (Disq, https://github.com/disq-bio/disq) for working with BAM/CRAM/VCF/etc. formats on Spark, and GCS (Google Cloud Storage) support in Funcotator.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
HaplotypeCaller/Mutect2- Output VCF spec-compliant phased variants in HaplotypeCaller and Mutect2
- Added an experimental adaptive pruning option for local assembly (#5473)
- Improved implementation of allele-specific new qual (#5460)
- Use cigar complexity to break ties in uninformative reads' best haplotypes (#5359)
- Improved handling of regions that are too short after trimming in HaplotypeCaller and in Mutect2 (Closes issue #5079)
- Optimization in
CigarUtilsto shortcut to M-only CIGAR when provably optimal (#5466) - Changed SUPPORTED_ALLELES_TAG from SA to XA (#5418)
-
HaplotypeCaller -
Mutect2- Big improvements to CalculateContamination's model for determining hom alt sites (#5413)
- Reduce false negatives from mapping quality filter on long indels in Mutect2 (#5497)
- Added a mismatch ratio option in realignment filter (#5501)
- Made Mutect2 read position filter default much less stringent (#5487)
- Fixed M2 bug for germline resources with AF=. (#5442)
- Fix read position annotation bug in M2 filter (#5495)
- Cleaner Mutect2 VCF fields (#5510)
- Moved PerAlleleAnnotations to the INFO field (#5518)
- Removed unnecessary inheritance of M2 filtering arguments collection (#5498)
-
GenotypeGVCFs- Restored the --include-non-variant-sites argument from GATK3 to GenotypeGVCFs (#5219)
-
Ported the GATK3 tool
VariantEvalto GATK4 (#5043) -
Replaced the Hadoop-BAM library with the newly-developed Disq library (https://github.com/disq-bio/disq) for efficiently working with BAM/CRAM/VCF/etc. formats on Spark (#5138)
- Improves Spark performance across-the-board, and fixes many edge-case bugs in Hadoop-BAM
-
Funcotator- Added GCS support to Funcotator data sources, so that data sources can now be accessed directly from GCS buckets (#5425)
- Added support for annotating 5'/3' flanks (#5403)
- Funcotator now creates default annotations for difficult variants. (#5374)
- Funcotator now can create annotations for symbollic alleles and masked alleles (#5406)
- Funcotator now can match between hg19 and b37 data sources. (#5491)
- Added in regression tests and fixes for correctness of many annotations (#5302)
- Now DE_NOVO_START_IN_FRAME and DE_NOVO_START_OUT_FRAME are correct. (#5357)
- Added cDNA Strings for Intronic Variants (#5321)
- VCF data sources create an ID field for the ID of the variant
used for the annotation (#5327) - Funcotator now computes MT protein changes. (#5361)
- Funcotator now correctly populates transcript position. (#5380)
- Added a script that can create data sources from BED files. (#5438)
- Updated testing Gencode data sources to fully exercise test data set (#5423)
- Moved validation test data out of large files area. (#5381)
- Updated top-level class documentation for Funcotator. (#4655)
- Added scripts to liftover gnomAD. Also bugfixes for Funcotator NIO. (#5514)
-
HaplotypeCallerSpark -
MarkDuplicatesSpark: Added a few of the remaining unimplemented useful features from Picard (#5377) -
CNV workflows- Changed
FilterIntervalsto operate on the intersection of intervals in all inputs. (#5408) - Fixed RAM usage parameter error in combine_tracks.wdl (#5358)
- Various other improvements to combine_tracks.wdl (#5384)
- Fixed gCNV WDL broken by Cromwell update on FireCloud. (#5407)
- Replaced bash script in gCNV ScatterIntervals task with updated version of IntervalListTools. (#5414)
- Changed
-
CNNScoreVariants- Check for and require hardware AVX support (#5291)
-
Changed
SelectVariantsso that it can handle multiple rsIDs separated by ';' in a VCF file (#5464) -
Miscellaneous Changes
- Added
setIsUnplaced()to theGATKReadAPI to distinguish reads with no mapping information (#5320) - Fixed an integer overflow bug in the
RMSMappingQualityannotation (#5435) - Fixed floating-point bug in MannWhitneyU on some JVMs. (#5371)
- Standardized the output argument for
LeftAlignIndels(#5474) SplitIntervalsnow produces an.interval_listfile (#5392)- Fixed a bug with GATK_GCS_STAGING in the GATK launcher script #1338 (#5452)
- Added ExampleReadWalkerWithVariantsSpark.java and tests (#5289)
- Add description getter and javadoc in GATKReportTable (#5443)
- Fixed message in GATKAnnotationPluginDescription (#5444)
- Replaced some uses of PrintWriter (#5461)
- Refactor GVCFWriter to allow push/pull iteration. (#5311)
- Add scripts/dataproc-cluster-ui to release bundle. (#5401)
- Marked
VariantAnnotatoras a@DocumentedFeature(#5480) - Removed obsolete intel conda environment references. (#5482)
- Deleted the CountSet class (#5467)
- Test framework: disabled gcloud login on travis for non-cloud non-wdl tests (#5335)
- Updated Spark scripts to reflect changes from #5386 and #5127. (#5415)
- Fixed jexl logging and updated VariantFiltration doc. (#5422)
- Fixed some dead links in the README (#5405)
- Added
-
Dependencies
4.0.11.0
A release which includes major improvements to Mitochondrial calling in Mutect2 as well as bug fixes and improvements:
As always a docker is available here: https://hub.docker.com/r/broadinstitute/gatk/
Mutect2 and HaplotypeCaller changes:
-
Added
--mitochondria-modetoMutect2andFilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria. A best practices WDL for calling mitochondrial variants on WGS data will be available in the future. (#5193) -
Strand based annotations will use both reads in an overlapping read pair (#5286)
-
Realignment filter annotates the VCF with passing and failing read counts (#5328)
-
New filters and annotation to support blood biopsy that count and filter based on N's at variant sites (#5317)
-
Fixed bug for M2 GGA alleles with zero coverage (#5303)
-
Fixed error in genotype given alleles mode when input alleles have genotypes (#5341) #5336
-
Add new annotations to bamout to make understanding calls easier (#5215)
-
Fixed a typo.
CNV Pipeline:
- Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. (#5307) closes #2992 #4558
Spark:
- Removed WellformedReadFilter from CountReadsSpark (#5329)
- Support fasta.gz in GATKSparkTool (#5290) closes #5258
Other:
- CNN variant update models validate scores cleanup training (#5175)
- combine_tracks.wdl supports GISTIC2 conversion (and bugfix) (#5287) closes #5284 #5283
- handle normal reads in validation sample in BasicSomaticValidator (#5322)
GenomicsDB:
- Allow for hdfs and gcs URI's to be passed to GenomicsDB (#5197)
SelectVariants:
SplitNCigarReads:
- Added defensive check to OverhangFixingManager splices for non-reference spanning reads (#5298) closes #5293
- Fixed SplitNCigarReads ArrayIndexOutOfBounds error for reads with long deletions (#5285) closes #5230
Testing:
4.0.10.1
This is a small release that improves the calculation of the MQ (mapping quality) annotation, which provides an estimate of the overall mapping quality of reads supporting a variant call. It also introduces a number of experimental improvements to the CNV workflows, as well as a bug fix to LocusWalkerSpark.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Improve MQ calculation accuracy (#4969)
- Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes.
- Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.
-
Updated
SimpleGermlineTaggerand somatic CNV experimental post-processing workflow with several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV (#5252)- New script
combine_tracks.wdlfor post-processing somatic CNV calls. This wdl will perform two operations:- Increases precision by removing:
- germline segments. As a result, the WDL requires the matched normal segments.
- Areas of common germline activity or error from other cancer studies.
- Converts the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL.
- This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation.
- For more information about AllelicCapSeg and ABSOLUTE, see:
- Carter et al. Absolute quantification of somatic DNA alterations in human cancer, Nat Biotechnol. 2012 May; 30(5): 413–421
- https://software.broadinstitute.org/cancer/cga/absolute
- Brastianos, P.K., Carter S.L., et al. Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets (2015) Cancer Discovery PMID:26410082
- Increases precision by removing:
- Changes to GATK tools to support the above:
SimpleGermlineTaggernow uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres.- Added tool
MergeAnnotatedRegionsByAnnotation. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.
- New scripts
multi_combine_tracks.wdlandaggregate_combine_tracks.wdlwhich runcombine_tracks.wdlon multiple pairs and combine the results into one seg file for easy consumption by IGV.
- New script
-
LocusWalkerSpark: fix issue where intervals with no reads were being dropped (#5222)- This fixes the bug reported in #3823
-
Added
SparkTestUtils.roundTripThroughJavaSerialization()method for better serialization testing on Spark (#5257) -
Build system: set the same compiler flags for all gradle JavaCompile tasks (#5256)