-
Notifications
You must be signed in to change notification settings - Fork 603
new PasteDepthEvidence tool #8031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #8031 +/- ##
===============================================
- Coverage 86.648% 86.635% -0.013%
- Complexity 38974 38992 +18
===============================================
Files 2337 2341 +4
Lines 182791 182958 +167
Branches 20067 20114 +47
===============================================
+ Hits 158385 158505 +120
- Misses 17364 17398 +34
- Partials 7042 7055 +13
|
bedf8d3
to
fe03229
Compare
Github actions tests reported job failures from actions build 3255713647
|
Github actions tests reported job failures from actions build 3255830437
|
Github actions tests reported job failures from actions build 3257265858
|
Github actions tests reported job failures from actions build 3259842595
|
d6a5912
to
6da6c2a
Compare
Rationale for engine changes: |
Reassigning to @lbergelson to review the engine changes in this PR in my absence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe I think it looks good but I have a some minor comments. I'm not totally clear on the reimplementation of feature reader, but I assume there's a good reason.
Do you not run into issues streaming these files without having the cloud buffer enabled? I've seen absolutely horrible performance reading from gcs when we don't have it on in other cases.
I didn't review the actual content of the tools or codecs, just the infrastructure around them.
this.hasIndex = false; | ||
this.supportsRandomAccess = true; | ||
} else if (featureReader instanceof AbstractFeatureReader) { | ||
this.hasIndex = ((AbstractFeatureReader<T, ?>)featureReader).hasIndex(); | ||
this.supportsRandomAccess = hasIndex; | ||
} else { | ||
throw new GATKException("Found a feature input that was neither GenomicsDB or a Tribble AbstractFeatureReader. Input was " + featureInput.toString() + "."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change. These classes predated the addition of the isQueryable()
method I believe and we apparently never got around to updating it. We should get the GenomicsDBReader to implement it and then we can drop all the special cases.
} | ||
// Due to a bug in HTSJDK, unindexed block compressed input files may fail to parse completely. For safety, | ||
// these files have been disabled. See https://github.com/broadinstitute/gatk/issues/4224 for discussion | ||
if (!hasIndex && IOUtil.hasBlockCompressedExtension(featureInput.getFeaturePath())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we fixed this but I really don't remember all the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you go to the referenced issue, you'll see that it's been addressed and closed.
final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath); | ||
// Gather all discovered codecs that produce the right feature subtype and claim to be able | ||
// to decode the given file according to their canDecode() methods | ||
final List<FeatureCodec<? extends Feature, ?>> candidateCodecs = getCandidateCodecsForFile(featurePath, featureType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -521,18 +510,22 @@ private <T extends Feature> FeatureDataSource<T> lookupDataSource( final Feature | |||
* @param featureFile file for which to find potential codecs | |||
* @return A List of all codecs in DISCOVERED_CODECS for which {@link FeatureCodec#canDecode(String)} returns true on the specified file | |||
*/ | |||
private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile ) { | |||
private static List<FeatureCodec<? extends Feature, ?>> getCandidateCodecsForFile( final Path featureFile, final Class<? extends Feature> featureType ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes a lot of sense to move the filtering here but the javadoc needs to be updated.
import java.util.*; | ||
|
||
/** | ||
* A MergingMultiFeatureWalker is a base class for a tool that processes one {@link Feature} at a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someday we'll rebase variant walkers on top of Feature walkers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some dreamy day maybe we'll rewrite the FeatureCodec interface to eliminate the goofy, unused methods, and require the necessary methods that currently only appear in subtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did! At least for Reads/Variants/Reference. FeatureCodec wasn't implemented yet... Checkout htsjdk.beta.plugin
final InputStream is = | ||
new BlockCompressedInputStream( | ||
new BufferedInputStream(Channels.newInputStream(sbc), 64*1024)); | ||
is.skip(virtualFileOffset & 0xffffL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is allowed to skip less than you specified. You should probably check the return value and throw if it's off.
public void close() {} | ||
|
||
@Override | ||
public List<String> getSequenceNames() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the codecs this supports don't include a sequence dictionary in their header? Should they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You noted that the TextFeatureReader returns an empty list in its getSequenceNames method: Not to despair, this method is not used by anyone.
You asked, "Shouldn't the files that this reader supports include sequence dictionary metadata?". Answer: These are legacy SV pipeline file formats. Might've been nice, but they don't and that horse has left the barn. Someday we may be able to switch the SV pipeline over to using BlockCompressedInterval files that do have metadata.
this.path = path; | ||
try { | ||
InputStream in = new BufferedInputStream(Files.newInputStream(path), 256*1024); | ||
if ( path.toString().endsWith(".gz") ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to use htsjdk's IOUtil.isBlockCompressed() because it handles things like http query parameter stripping.
} | ||
} while ( nextFeature != null ); | ||
} catch ( final IOException ioe ) { | ||
throw new UserException("Can't read feature from " + path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attach the io exception here.
try { | ||
queryable = AbstractFeatureReader.isTabix(path.toString(), null); | ||
} catch ( final IOException ioe ) { | ||
throw new UserException("Error while checking for existence of index for " + path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attach the exception here
Results of merging 3 files with ~10,000,000 features each from a GCS bucket using my home linux box: (Earlier comments erased, because they were stupid: I forgot that I hadn't hooked up prefetch.) |
@tedsharpe Thanks for checking. In general I've seen the CPB tends to help a lot when reading through long contiguous stretches of BAM file and less when doing anything on smaller or fragmented data. I'm surprised it didn't make any difference here, but seems like it doesn't so that's fine. I've seen catastrophic interactions between insufficiently buffered index inputs with it disabled where it ended up performing an http request for every byte, but hopefully that's avoided just by using a buffered reader for it. I have a plan to someday enable the more intelligent -L aware prefetcher that will use the list of actual positions of interest to buffer more intelligently, but that's not happening on any specific schedule. |
I believe all your comments have been addressed. |
Oh: The main reason for reimplementing a text-based feature reader is that the existing Tabix reader reads the entire index (which is huge) into memory right in the constructor. Also, the abstract tabix parsing code doesn't work on some of the weird files people have invented that have odd combinations of header lines. I think there were other reasons, too, that I've forgotten. I studied it a while to see if I could fix it, but eventually came to the conclusion that it was hopeless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe Thanks for the changes. I appreciate it respecting the horrible wrapper arguments for parities sake. Those are on the infinite list of things to improve so they're not as hideous but I don't know if we'll ever get there.
I think autorename mangled your example walkers name. Should be good to merge when that's fixed.
* To use this walker you need only implement the abstract {@link #traverse()} method in a class | ||
* that declares a collection of FeatureInputs as an argument. | ||
* There are two subclasses that may be convenient: | ||
* The MergingMultiFeatureWalker presents one feature at a time, in sorted order, by merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is super helpful for future readers.
programGroup = ExampleProgramGroup.class | ||
) | ||
public class ExampleMergingMultiFeatureWalker extends MergingMultiFeatureWalker<Feature> { | ||
public class ExampleMergingMultiFeatureWalkerBase extends MergingMultiFeatureWalker<Feature> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a casualty of auto rename.
@@ -17,8 +17,10 @@ | |||
implements FeatureOutputCodec<F, Writer<F>>, FeatureCodec<F, Reader<F>>, FeatureReaderFactory<F> { | |||
|
|||
@Override | |||
public Reader<F> getReader( final FeatureInput<F> input ) { | |||
return new Reader<F>(input, this); | |||
public Reader<F> getReader( final FeatureInput<F> input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I appreciate you including these even if they're not super helpful, just for parity. A low priority future task is to wrap these in an options class of some sort so it's less hideous.
@@ -9,7 +9,7 @@ | |||
import org.testng.Assert; | |||
import org.testng.annotations.Test; | |||
|
|||
public class ExampleMergingMultiFeatureWalkerIntegrationTest extends CommandLineProgramTest { | |||
public class ExampleMergingMultiFeatureWalkerBaseIntegrationTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autorename?
@tedsharpe So we've addressed a bunch of the issues you've mentioned in the new htsjdk.beta reader apis. In particular we now optimize the decoding checks so that we only scan the beginning of the files exactly once and the give the necessary bytes to all available codecs. That should work in combination with your optimization to push down the filtering which makes a lot of sense as well. Lazily loading the indexes when you need to query for them is a good idea. I think the thought behind aggressively loading them was probably to handle potential errors up front instead of waiting until later, and it's confounded by the bad choice of automatically discovering indexes by default so the engine can't tell if there SHOULD be an index until it tries and fails to load it. We've also addressed that in the new APIs to give the client more control over how indexes are discovered which should allow for cleaner lazy loading and things like that. |
8bf70b4
to
190ff8d
Compare
Bincov generation in one step.
Need to think about whether we need to sample to discover the full bin size (like in the current SetBins task), or whether we'll always supply it as an explicit parameter of the workflow, or whether the approach I took here (set it from the first row, unless explicitly provided) is adequate. If not, I'll write code to buffer some rows to figure out the bin size before writing anything to the output file.