new PasteDepthEvidence tool #8031

tedsharpe · 2022-09-22T18:38:29Z

Bincov generation in one step.
Need to think about whether we need to sample to discover the full bin size (like in the current SetBins task), or whether we'll always supply it as an explicit parameter of the workflow, or whether the approach I took here (set it from the first row, unless explicitly provided) is adequate. If not, I'll write code to buffer some rows to figure out the bin size before writing anything to the output file.

codecov · 2022-09-22T19:57:34Z

Codecov Report

Merging #8031 (8bf70b4) into master (0261d43) will decrease coverage by 0.014%.
The diff coverage is 81.051%.

❗ Current head 8bf70b4 differs from pull request most recent head 190ff8d. Consider uploading reports for the commit 190ff8d to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##              master     #8031       +/-   ##
===============================================
- Coverage     86.648%   86.635%   -0.013%     
- Complexity     38974     38992       +18     
===============================================
  Files           2337      2341        +4     
  Lines         182791    182958      +167     
  Branches       20067     20114       +47     
===============================================
+ Hits          158385    158505      +120     
- Misses         17364     17398       +34     
- Partials        7042      7055       +13

Impacted Files	Coverage Δ
...e/hellbender/engine/FeatureDataSourceUnitTest.java	`95.146% <ø> (+0.612%)`	⬆️
...tute/hellbender/engine/FeatureManagerUnitTest.java	`90.826% <ø> (ø)`
...llbender/engine/FeatureSupportIntegrationTest.java	`100.000% <ø> (ø)`
...ute/hellbender/utils/codecs/AbstractTextCodec.java	`28.571% <28.571%> (ø)`
...adinstitute/hellbender/tools/IndexFeatureFile.java	`61.333% <38.298%> (-38.667%)`	⬇️
...lbender/utils/codecs/FeatureOutputCodecFinder.java	`73.077% <50.000%> (-6.090%)`	⬇️
...roadinstitute/hellbender/engine/FeatureWalker.java	`92.308% <66.667%> (ø)`
...stitute/hellbender/utils/io/TextFeatureReader.java	`78.125% <78.125%> (ø)`
...titute/hellbender/tools/sv/PasteDepthEvidence.java	`78.788% <78.788%> (ø)`
...e/hellbender/engine/TabularMultiFeatureWalker.java	`80.000% <80.000%> (ø)`
... and 76 more

gatk-bot · 2022-10-15T12:57:03Z

Github actions tests reported job failures from actions build 3255713647
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	8	3255713647.10	logs
cloud	11	3255713647.11	logs
unit	11	3255713647.13	logs
integration	11	3255713647.12	logs
unit	8	3255713647.1	logs
integration	8	3255713647.0	logs

gatk-bot · 2022-10-15T13:49:29Z

Github actions tests reported job failures from actions build 3255830437
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	11	3255830437.13	logs
integration	11	3255830437.12	logs
unit	8	3255830437.1	logs
integration	8	3255830437.0	logs

gatk-bot · 2022-10-15T22:23:51Z

Github actions tests reported job failures from actions build 3257265858
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	11	3257265858.13	logs
integration	11	3257265858.12	logs
unit	8	3257265858.1	logs
integration	8	3257265858.0	logs

gatk-bot · 2022-10-16T15:24:51Z

Github actions tests reported job failures from actions build 3259842595
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	11	3259842595.12	logs
integration	8	3259842595.0	logs

tedsharpe · 2022-10-19T17:23:37Z

Rationale for engine changes:
This tool opens a large number of feature files (TSVs, not VariantContexts) and iterates over them simultaneously. No querying, just a single pass through each.
Issue 1: When a feature file lives in the cloud, it takes unacceptably long (several seconds, typically) to initialize it. A few seconds doesn't seem like a long time, but when there are large numbers of feature files to open, it adds up. This is caused by a large number of codecs (mostly the vcf-processing codecs) opening and reading the first few bytes of the file in the canDecode method. To avoid this I've reversed the order in which we test each codec, checking first if it produces the correct subtype of Feature, and only then calling canDecode. If you don't know what specific subtype you need, you can just ask for any Feature by passing Feature.class. It's much faster that way.
Issue 2: Each open feature source soaks up a huge amount of memory. That's because text-based feature reading is optimized for VCFs, which can have enormously long lines. So huge buffers are allocated. The problem is compounded for cloud-based feature files for which we allocate a large cloud prefetch buffer. (Though that feature can be turned off, which helps a little.) But the biggest memory hog is the TabixReader, which always reads in the index, regardless of whether it's used or not. Tabix indices are very large. To avoid this, I've created a smaller, simpler FeatureReader subclass called a TextFeatureReader that loads the index only when necessary. The revisions allow the new tool to run using an order of magnitude less memory. Faster, too.
Issue 3: The code in FeatureDataSource that creates a FeatureReader is brittle, and tests for various subclasses. To allow use of the new TextFeatureReader, I added a FeatureReaderFactory interface that allows one to ask the codec for an appropriate FeatureReader.

droazen · 2022-12-06T06:05:18Z

Reassigning to @lbergelson to review the engine changes in this PR in my absence.

lbergelson

@tedsharpe I think it looks good but I have a some minor comments. I'm not totally clear on the reimplementation of feature reader, but I assume there's a good reason.

Do you not run into issues streaming these files without having the cloud buffer enabled? I've seen absolutely horrible performance reading from gcs when we don't have it on in other cases.

I didn't review the actual content of the tools or codecs, just the infrastructure around them.

lbergelson · 2022-12-15T19:19:44Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureDataSource.java

            this.hasIndex = false;
            this.supportsRandomAccess = true;
        } else if (featureReader instanceof AbstractFeatureReader) {
            this.hasIndex = ((AbstractFeatureReader<T, ?>)featureReader).hasIndex();
            this.supportsRandomAccess = hasIndex;
        } else {
-            throw new GATKException("Found a feature input that was neither GenomicsDB or a Tribble AbstractFeatureReader.  Input was " + featureInput.toString() + ".");


This is a good change. These classes predated the addition of the isQueryable() method I believe and we apparently never got around to updating it. We should get the GenomicsDBReader to implement it and then we can drop all the special cases.

lbergelson · 2022-12-15T19:20:26Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureDataSource.java

-        }
-        // Due to a bug in HTSJDK, unindexed block compressed input files may fail to parse completely. For safety,
-        // these files have been disabled. See https://github.com/broadinstitute/gatk/issues/4224 for discussion
-        if (!hasIndex && IOUtil.hasBlockCompressedExtension(featureInput.getFeaturePath())) {


I think we fixed this but I really don't remember all the details.

If you go to the referenced issue, you'll see that it's been addressed and closed.

lbergelson · 2022-12-15T19:24:04Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureManager.java

-        final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath);
+        // Gather all discovered codecs that produce the right feature subtype and claim to be able
+        // to decode the given file according to their canDecode() methods
+        final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath, featureType);


lbergelson · 2022-12-15T21:13:41Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureManager.java

@@ -521,18 +510,22 @@ private <T extends Feature> FeatureDataSource<T> lookupDataSource( final Feature
     * @param featureFile file for which to find potential codecs
     * @return A List of all codecs in DISCOVERED_CODECS for which {@link FeatureCodec#canDecode(String)} returns true on the specified file
     */
-    private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile )  {
+    private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile, final Class<? extends Feature> featureType )  {


I think it makes a lot of sense to move the filtering here but the javadoc needs to be updated.

lbergelson · 2022-12-15T21:23:17Z

src/main/java/org/broadinstitute/hellbender/engine/MergingMultiFeatureWalker.java

+import java.util.*;
+
+/**
+ * A MergingMultiFeatureWalker is a base class for a tool that processes one {@link Feature} at a


Someday we'll rebase variant walkers on top of Feature walkers...

Some dreamy day maybe we'll rewrite the FeatureCodec interface to eliminate the goofy, unused methods, and require the necessary methods that currently only appear in subtypes.

We did! At least for Reads/Variants/Reference. FeatureCodec wasn't implemented yet... Checkout htsjdk.beta.plugin

lbergelson · 2022-12-16T20:05:42Z