-
Notifications
You must be signed in to change notification settings - Fork 52
implement tabix region query natively in python #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing piece of work, huge kudos.
I added a couple of minor comments, and one question.
Also I wonder if it's worth adding a test that involves reading a region that spans at least two blocks of a bgzipped VCF, so we can be sure that reading across multiple blocks is working. (Apologies if this is already tested somehow.)
Thanks again!
@@ -1957,7 +1954,7 @@ def test_genotype_ac(): | |||
def test_region_truncate(): | |||
vcf_path = fixture_path('test54.vcf.gz') | |||
for tabix in 'tabix', None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tabix
is now unused here.
@@ -2116,11 +2113,11 @@ def test_vcf_to_npz(): | |||
for vcf_path, region, tabix, samples, string_type in param_matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tabix
is now unused, could be removed from param_matrix
.
@@ -2187,11 +2184,11 @@ def test_vcf_to_zarr(): | |||
for vcf_path, region, tabix, samples, string_type in param_matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tabix
is now unused, could be removed from param_matrix
.
@@ -2372,11 +2369,11 @@ def test_vcf_to_hdf5(): | |||
for vcf_path, region, tabix, samples, string_type in param_matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tabix
is now unused, could be removed from param_matrix
.
# binning index: record cluster in large interval | ||
overlap = np.concatenate([chunks for bin_key, chunks in bidx.items() if chunks is not None]) | ||
# coupled binning and linear indices, filter out low level bins | ||
chunk_begin, *_, chunk_end = np.sort(np.ravel(overlap[overlap[:, 0] >= min_ioff])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be very grateful if you could explain what's happening here, for my own understanding. What do we gain by using the binning and linear indices together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice question, I am not clear here, too.
So the code here suppose to get the chunk_end
value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution:
-
The best solution: parse and check the record, once we get through the target region, stop reading. But the parse and check process has implement in the fellow process, so that is duplicate. PASS.
-
The easy understanding solution: using the linear index, read until the block which target region end position fall into. But this solution is a bit annoying to me. Becouse the linear index only track the begin offset, to get the end offset, we might need to parse the next chromosome index and/or check wether the block fully contain the record overlap our target region. PASS.
-
The solution here the code using: using the binning index, the binning index track both begin and end offset, so sort and get the biggest offset which is the point we stop reading the VCF.
There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region, for example:
-------------------------|---query-region---|-----------------
------|---record-will-be-ignore---|-------------------------
But if I understand the comment
# N.B., still pass the region parameter through so we get strictly only
# variants that start within the requested region. See also
# https://github.com/alimanfoo/vcfnp/issues/54
corectly, these records wil be filter out, so the solution here should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yangmqglobe, thanks so much for following up here, many apologies for slow response.
This is tricky, I will try to discuss this first:
There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region.
From Li (2011): "In the linear index, we keep for each tiling 16 kb window the virtual file offset of the leftmost record (i.e. having the smallest start coordinate) that overlaps the window."
E.g., what if you have a VCF with a large duplication that starts at position 20 kb and ends at 40 kb, and all other records are SNPs. And then you query for variants in the region 35-50 kb. You take the query start (35 kb), floor divide by 16 kb, to get the corresponding window position within linear index, in this case 2, i.e., the third window, which spans 32-48 kb. You then access the third element within the linear index, which gives you the virtual file offset for the first variant overlapping the window 32-48 kb. This will be the virtual file offset for the record containing the large deletion starting at position 20 kb, because it is the first record overlapping the 32-48 kb window.
In other words, I expect the linear index may give you variants that start before the query region. So filtering out results from the binning index won't stop this from happening.
But as you also point out, the underlying VCF parsing code in scikit-allel will filter these out when it starts reading, so I don't think we have to worry about this. The code that handles this is here. (As an aside, we might want to consider making this behaviour optional, so a user can retrieve all variants overlapping the query region like tabix does, not just those which strictly start within the query region, but that's a separate issue that could be address later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this point:
So the code here suppose to get the chunk_end value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution.
If the binning index gives the end offset, then this sounds like a reasonable option to figure out when to stop reading. To be honest I still don't fully understand the binning index, so I cannot make a perfect judgement here.
An alternative I think is to use the linear index to find the offset to start reading, and then just keep reading until the parser finds a record which starts beyond the query end. I.e., let the underlying parser figure out when to stop reading.
Currently the parser does check whether the current variant record starts beyond the end of the query region, code here. However, that code currently is designed for the case where a VCF is not sorted (yes that happens) and so it only skips the variant, it does not terminate parsing. However, we could pass down a piece of information that the VCF is sorted (which we know it will be if querying via tabix index), and the parser could then use this to terminate parsing rather than just skipping the record.
In other words, I think it would be possible to just use the linear index to find a suitable file offset to start reading from, and then let the parser figure out when to stop reading based on parsing records.
Does that fit your understanding? Do you have any thoughts on which strategy is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wonder why there is not a sorted
parameter all the time.
cc @jeromekelleher, @benjeffery, I don't suppose either of you would have a chance to look at this, it looks good to me but I haven't grokked every detail and would much appreciate another pair of eyes. |
@alimanfoo I will check all the By the way, we definitely need more test, I test the function with a WGS result which contain more than 1000 mutations in a specify region, it works well. But this is a clinical patient result which can not go public due to private issue. I am not sure how to get a proper VCF to test these function, any idea of that? |
This LGTM @alimanfoo, but I don't know enough about the tabix index format to given any real technical input. In terms of testing, an easy option would be to generate some VCFs using msprime, and then do some random queries. If you get the same results as Pysam, then we're probably good. It'll be a lot of work testing out all the corner cases in the binary decoding steps though. |
Hi @brentp, we're trying to do a pure Python implementation of querying a tabix-indexed VCF file, and thrashing around a bit trying to figure out the best approach. (We're avoiding binding to the tabix C implementation because it doesn't compile on Windows.) I know you've implemented tabix for Go and so I wondered if you might be able to share any wisdom? |
The 1000 genomes phase 3 data might be useful, could download a chromosome, then pull out a subset (e.g., enough variants to at least span a couple of bgzip blocks), then recompress and tabix index? |
generator | ||
|
||
""" | ||
begin, end = begin, end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't seem to do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault, the origin purpose is to convert to zero base region, but I do that in other place but forget to delete this, I will fix that.
hi, happy to try and help if I can. Is there anything in particular? |
Actually, it might help to look at the tabix-related issues in biogo as there were some edge-cases that differed from the spec. |
After a bit more reading, I think I have grokked this. Here's my understanding... Given a query region, Note that any of these file chunks can contain records that do not overlap the query region. This can include records that end before the query start, and records that start after the query end. So, if using the binning index alone, we would read through each file chunk, but still need to parse each record and check whether it overlaps the query. It would be possible to just use the binning index by itself. In some files, if there are some records that span a relatively large genome region, if we just used the binning index we might end up seeking to and reading chunks associated with higher level bins, even though none of the records in that chunk actually overlap the query region. So the idea of the linear index is to give you a file offset that is close to the start of the first record that overlaps the query region, so you can skip over some of the chunks from higher level bins returned by the binning index. It would also be possible to just use the linear index by itself. For files where all records span a small genome region (e.g., VCF with few or no large structural variants), this works fine, because the linear index gives you a file offset to start reading from, and it will be quite efficient to just start scanning the file from that point to find all the records overlapping the query region, and stop scanning once records are beyond the query. However, for files like GFF where there may be some records that span large genome regions (e.g., whole chromosomes), this is not efficient, because you will always end up scanning basically from the start of the chromosome. The binning index is better, because it will give a set of chunks that can be seeked to, avoiding a lot of scanning and parsing. In the case of a VCF with few or no large SVs, it strikes me that either the binning index by itself or the linear index by itself will perform very similarly. If using the binning index, there will still be some high level bins overlapping most queries, but they won't have any chunks associated with them (n_chunks=0) and so they won't cause any unnecessary seeking. In the general case, given any type of file, the strategy is to use the binning index to obtain a set of file chunks, then use the linear index to exclude some of the earlier file chunks which don't contain any records that overlap the query, then start reading and parsing records within the remaining chunks, checking and returning if records overlap the query. Also in general, these chunks might not be contiguous in the file, so you would want to read a chunk, seek to the start of the next chunk, read a chunk, seek to the next chunk, etc. I.e., it not be good to just start reading from the start of the first chunk and stop reading at the end of the last chunk. For this implementation, I think this means that we need to decide whether to either (a) implement the general strategy for querying a tabix indexed file using the combined binning and linear indices, which will work well for VCF or GFF or whatever, or (b) implement a simpler strategy tuned for VCF that just uses the linear index, given that we know VCF files tend to have very few if any records that span larger genome regions. |
for _ in range(struct.unpack('<i', f.read(4))[0]): # n_bins | ||
bin_key = struct.unpack('<I', f.read(4))[0] # bin | ||
n_chunk = struct.unpack('<i', f.read(4))[0] # n_chunk | ||
if bin_key in bidx: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_chunk
might be zero? If so, could use that to simplify a little...
if bin_key in bidx: | |
if bin_key in bidx and n_chunk > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure would n_chunk
be zero? But for
can use as if
, and a judge do nothing.
Thanks a lot @brentp, the biogo links are very useful. I'd really appreciate just a check that my understanding of how to use a .tbi index is right. On a different note, do you know where there is a spec for .csi? |
here is the spec for CSI. The implementation in biogo might be a good reference. |
Sorry for slow response, too. Chinese new year is coming, happy new year!
|
Happy new year to you! That sounds right to me, with a small edit:
I think it is possible that some bins could have zero chunks. E.g., for a VCF file with only SNPs and small indels, I would expect all the higher level bins to have no records associated with them, and hence no chunks. |
... and one more step I think ...
|
I think that 0 chunk is not keep in the |
truncate might lose some record that are start before query region, I am planing to achieve GFF random access, too. |
IIUC |
I think that we might miss some records, |
OK, that's good to know, maybe it's not a good idea, or at least not worth
it (marginal performance gain).
…On Thu, 23 Jan 2020 at 14:33, 安迪君 ***@***.***> wrote:
I think that we might miss some records, tabix C code did not truncate
the offset. we might need some case to prove that and that would be a
necessary test case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#297?email_source=notifications&email_token=AAFLYQUCYUOD36Q4UAIUCXLQ7GTCFA5CNFSM4KDV77RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXRVKQ#issuecomment-577706666>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQQAGLI3YT5SGLTJLRTQ7GTCFANCNFSM4KDV77RA>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
I have implement tabix region query natively in python, this will fix #284 and #292 , more over, this will drop the dependence of tabix and allow all OSes user random access the VCF file if they have the index file.
Implementation information can be found from
https://academic.oup.com/bioinformatics/article/27/5/718/262743
http://samtools.github.io/hts-specs/tabix.pdf
http://samtools.github.io/hts-specs/SAMv1.pdf
These changes effect a lot of api, mostly are
read_vcf
related, thetabix
pramater have been replaced byindex
which use to specify the path of the index file, default toNone
, means that it will try to find the VCF path with a.tbi
suffix.