Skip to content

new PasteDepthEvidence tool #8031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

tedsharpe
Copy link
Contributor

Bincov generation in one step.
Need to think about whether we need to sample to discover the full bin size (like in the current SetBins task), or whether we'll always supply it as an explicit parameter of the workflow, or whether the approach I took here (set it from the first row, unless explicitly provided) is adequate. If not, I'll write code to buffer some rows to figure out the bin size before writing anything to the output file.

@codecov
Copy link

codecov bot commented Sep 22, 2022

Codecov Report

Merging #8031 (8bf70b4) into master (0261d43) will decrease coverage by 0.014%.
The diff coverage is 81.051%.

❗ Current head 8bf70b4 differs from pull request most recent head 190ff8d. Consider uploading reports for the commit 190ff8d to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##              master     #8031       +/-   ##
===============================================
- Coverage     86.648%   86.635%   -0.013%     
- Complexity     38974     38992       +18     
===============================================
  Files           2337      2341        +4     
  Lines         182791    182958      +167     
  Branches       20067     20114       +47     
===============================================
+ Hits          158385    158505      +120     
- Misses         17364     17398       +34     
- Partials        7042      7055       +13     
Impacted Files Coverage Δ
...e/hellbender/engine/FeatureDataSourceUnitTest.java 95.146% <ø> (+0.612%) ⬆️
...tute/hellbender/engine/FeatureManagerUnitTest.java 90.826% <ø> (ø)
...llbender/engine/FeatureSupportIntegrationTest.java 100.000% <ø> (ø)
...ute/hellbender/utils/codecs/AbstractTextCodec.java 28.571% <28.571%> (ø)
...adinstitute/hellbender/tools/IndexFeatureFile.java 61.333% <38.298%> (-38.667%) ⬇️
...lbender/utils/codecs/FeatureOutputCodecFinder.java 73.077% <50.000%> (-6.090%) ⬇️
...roadinstitute/hellbender/engine/FeatureWalker.java 92.308% <66.667%> (ø)
...stitute/hellbender/utils/io/TextFeatureReader.java 78.125% <78.125%> (ø)
...titute/hellbender/tools/sv/PasteDepthEvidence.java 78.788% <78.788%> (ø)
...e/hellbender/engine/TabularMultiFeatureWalker.java 80.000% <80.000%> (ø)
... and 76 more

@tedsharpe tedsharpe force-pushed the tws_PasteDepthEvidence branch from bedf8d3 to fe03229 Compare October 15, 2022 12:36
@gatk-bot
Copy link

gatk-bot commented Oct 15, 2022

Github actions tests reported job failures from actions build 3255713647
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3255713647.10 logs
cloud 11 3255713647.11 logs
unit 11 3255713647.13 logs
integration 11 3255713647.12 logs
unit 8 3255713647.1 logs
integration 8 3255713647.0 logs

@gatk-bot
Copy link

gatk-bot commented Oct 15, 2022

Github actions tests reported job failures from actions build 3255830437
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3255830437.13 logs
integration 11 3255830437.12 logs
unit 8 3255830437.1 logs
integration 8 3255830437.0 logs

@gatk-bot
Copy link

gatk-bot commented Oct 15, 2022

Github actions tests reported job failures from actions build 3257265858
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3257265858.13 logs
integration 11 3257265858.12 logs
unit 8 3257265858.1 logs
integration 8 3257265858.0 logs

@gatk-bot
Copy link

gatk-bot commented Oct 16, 2022

Github actions tests reported job failures from actions build 3259842595
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 3259842595.12 logs
integration 8 3259842595.0 logs

@tedsharpe tedsharpe force-pushed the tws_PasteDepthEvidence branch from d6a5912 to 6da6c2a Compare October 18, 2022 21:00
@tedsharpe
Copy link
Contributor Author

tedsharpe commented Oct 19, 2022

Rationale for engine changes:
This tool opens a large number of feature files (TSVs, not VariantContexts) and iterates over them simultaneously. No querying, just a single pass through each.
Issue 1: When a feature file lives in the cloud, it takes unacceptably long (several seconds, typically) to initialize it. A few seconds doesn't seem like a long time, but when there are large numbers of feature files to open, it adds up. This is caused by a large number of codecs (mostly the vcf-processing codecs) opening and reading the first few bytes of the file in the canDecode method. To avoid this I've reversed the order in which we test each codec, checking first if it produces the correct subtype of Feature, and only then calling canDecode. If you don't know what specific subtype you need, you can just ask for any Feature by passing Feature.class. It's much faster that way.
Issue 2: Each open feature source soaks up a huge amount of memory. That's because text-based feature reading is optimized for VCFs, which can have enormously long lines. So huge buffers are allocated. The problem is compounded for cloud-based feature files for which we allocate a large cloud prefetch buffer. (Though that feature can be turned off, which helps a little.) But the biggest memory hog is the TabixReader, which always reads in the index, regardless of whether it's used or not. Tabix indices are very large. To avoid this, I've created a smaller, simpler FeatureReader subclass called a TextFeatureReader that loads the index only when necessary. The revisions allow the new tool to run using an order of magnitude less memory. Faster, too.
Issue 3: The code in FeatureDataSource that creates a FeatureReader is brittle, and tests for various subclasses. To allow use of the new TextFeatureReader, I added a FeatureReaderFactory interface that allows one to ask the codec for an appropriate FeatureReader.

@droazen
Copy link
Contributor

droazen commented Dec 6, 2022

Reassigning to @lbergelson to review the engine changes in this PR in my absence.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe I think it looks good but I have a some minor comments. I'm not totally clear on the reimplementation of feature reader, but I assume there's a good reason.

Do you not run into issues streaming these files without having the cloud buffer enabled? I've seen absolutely horrible performance reading from gcs when we don't have it on in other cases.

I didn't review the actual content of the tools or codecs, just the infrastructure around them.

this.hasIndex = false;
this.supportsRandomAccess = true;
} else if (featureReader instanceof AbstractFeatureReader) {
this.hasIndex = ((AbstractFeatureReader<T, ?>)featureReader).hasIndex();
this.supportsRandomAccess = hasIndex;
} else {
throw new GATKException("Found a feature input that was neither GenomicsDB or a Tribble AbstractFeatureReader. Input was " + featureInput.toString() + ".");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. These classes predated the addition of the isQueryable() method I believe and we apparently never got around to updating it. We should get the GenomicsDBReader to implement it and then we can drop all the special cases.

}
// Due to a bug in HTSJDK, unindexed block compressed input files may fail to parse completely. For safety,
// these files have been disabled. See https://github.com/broadinstitute/gatk/issues/4224 for discussion
if (!hasIndex && IOUtil.hasBlockCompressedExtension(featureInput.getFeaturePath())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we fixed this but I really don't remember all the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go to the referenced issue, you'll see that it's been addressed and closed.

final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath);
// Gather all discovered codecs that produce the right feature subtype and claim to be able
// to decode the given file according to their canDecode() methods
final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath, featureType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -521,18 +510,22 @@ private <T extends Feature> FeatureDataSource<T> lookupDataSource( final Feature
* @param featureFile file for which to find potential codecs
* @return A List of all codecs in DISCOVERED_CODECS for which {@link FeatureCodec#canDecode(String)} returns true on the specified file
*/
private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile ) {
private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile, final Class<? extends Feature> featureType ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes a lot of sense to move the filtering here but the javadoc needs to be updated.

import java.util.*;

/**
* A MergingMultiFeatureWalker is a base class for a tool that processes one {@link Feature} at a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday we'll rebase variant walkers on top of Feature walkers...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some dreamy day maybe we'll rewrite the FeatureCodec interface to eliminate the goofy, unused methods, and require the necessary methods that currently only appear in subtypes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did! At least for Reads/Variants/Reference. FeatureCodec wasn't implemented yet... Checkout htsjdk.beta.plugin

final InputStream is =
new BlockCompressedInputStream(
new BufferedInputStream(Channels.newInputStream(sbc), 64*1024));
is.skip(virtualFileOffset & 0xffffL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is allowed to skip less than you specified. You should probably check the return value and throw if it's off.

public void close() {}

@Override
public List<String> getSequenceNames() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the codecs this supports don't include a sequence dictionary in their header? Should they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You noted that the TextFeatureReader returns an empty list in its getSequenceNames method: Not to despair, this method is not used by anyone.
You asked, "Shouldn't the files that this reader supports include sequence dictionary metadata?". Answer: These are legacy SV pipeline file formats. Might've been nice, but they don't and that horse has left the barn. Someday we may be able to switch the SV pipeline over to using BlockCompressedInterval files that do have metadata.

this.path = path;
try {
InputStream in = new BufferedInputStream(Files.newInputStream(path), 256*1024);
if ( path.toString().endsWith(".gz") ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to use htsjdk's IOUtil.isBlockCompressed() because it handles things like http query parameter stripping.

}
} while ( nextFeature != null );
} catch ( final IOException ioe ) {
throw new UserException("Can't read feature from " + path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attach the io exception here.

try {
queryable = AbstractFeatureReader.isTabix(path.toString(), null);
} catch ( final IOException ioe ) {
throw new UserException("Error while checking for existence of index for " + path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attach the exception here

@tedsharpe
Copy link
Contributor Author

tedsharpe commented Dec 19, 2022

Results of merging 3 files with ~10,000,000 features each from a GCS bucket using my home linux box:
CPB 1 is not quite twice as fast as CPB 0, so you were right. CPB > 1 just slows things down.

(Earlier comments erased, because they were stupid: I forgot that I hadn't hooked up prefetch.)

@lbergelson
Copy link
Member

@tedsharpe Thanks for checking. In general I've seen the CPB tends to help a lot when reading through long contiguous stretches of BAM file and less when doing anything on smaller or fragmented data. I'm surprised it didn't make any difference here, but seems like it doesn't so that's fine.

I've seen catastrophic interactions between insufficiently buffered index inputs with it disabled where it ended up performing an http request for every byte, but hopefully that's avoided just by using a buffered reader for it.

I have a plan to someday enable the more intelligent -L aware prefetcher that will use the list of actual positions of interest to buffer more intelligently, but that's not happening on any specific schedule.

@tedsharpe
Copy link
Contributor Author

I believe all your comments have been addressed.
Thanks very much for the review. It was helpful.

@tedsharpe
Copy link
Contributor Author

Oh: The main reason for reimplementing a text-based feature reader is that the existing Tabix reader reads the entire index (which is huge) into memory right in the constructor. Also, the abstract tabix parsing code doesn't work on some of the weird files people have invented that have odd combinations of header lines. I think there were other reasons, too, that I've forgotten. I studied it a while to see if I could fix it, but eventually came to the conclusion that it was hopeless.

@tedsharpe tedsharpe requested review from lbergelson and removed request for droazen December 23, 2022 16:48
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Thanks for the changes. I appreciate it respecting the horrible wrapper arguments for parities sake. Those are on the infinite list of things to improve so they're not as hideous but I don't know if we'll ever get there.

I think autorename mangled your example walkers name. Should be good to merge when that's fixed.

* To use this walker you need only implement the abstract {@link #traverse()} method in a class
* that declares a collection of FeatureInputs as an argument.
* There are two subclasses that may be convenient:
* The MergingMultiFeatureWalker presents one feature at a time, in sorted order, by merging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is super helpful for future readers.

programGroup = ExampleProgramGroup.class
)
public class ExampleMergingMultiFeatureWalker extends MergingMultiFeatureWalker<Feature> {
public class ExampleMergingMultiFeatureWalkerBase extends MergingMultiFeatureWalker<Feature> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a casualty of auto rename.

@@ -17,8 +17,10 @@
implements FeatureOutputCodec<F, Writer<F>>, FeatureCodec<F, Reader<F>>, FeatureReaderFactory<F> {

@Override
public Reader<F> getReader( final FeatureInput<F> input ) {
return new Reader<F>(input, this);
public Reader<F> getReader( final FeatureInput<F> input,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I appreciate you including these even if they're not super helpful, just for parity. A low priority future task is to wrap these in an options class of some sort so it's less hideous.

@@ -9,7 +9,7 @@
import org.testng.Assert;
import org.testng.annotations.Test;

public class ExampleMergingMultiFeatureWalkerIntegrationTest extends CommandLineProgramTest {
public class ExampleMergingMultiFeatureWalkerBaseIntegrationTest extends CommandLineProgramTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autorename?

@lbergelson
Copy link
Member

@tedsharpe So we've addressed a bunch of the issues you've mentioned in the new htsjdk.beta reader apis. In particular we now optimize the decoding checks so that we only scan the beginning of the files exactly once and the give the necessary bytes to all available codecs. That should work in combination with your optimization to push down the filtering which makes a lot of sense as well.

Lazily loading the indexes when you need to query for them is a good idea. I think the thought behind aggressively loading them was probably to handle potential errors up front instead of waiting until later, and it's confounded by the bad choice of automatically discovering indexes by default so the engine can't tell if there SHOULD be an index until it tries and fails to load it. We've also addressed that in the new APIs to give the client more control over how indexes are discovered which should allow for cleaner lazy loading and things like that.

@lbergelson lbergelson assigned tedsharpe and unassigned lbergelson Dec 23, 2022
@tedsharpe tedsharpe force-pushed the tws_PasteDepthEvidence branch from 8bf70b4 to 190ff8d Compare December 24, 2022 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants