implement tabix region query natively in python #297

yangmqglobe · 2020-01-07T09:53:42Z

I have implement tabix region query natively in python, this will fix #284 and #292 , more over, this will drop the dependence of tabix and allow all OSes user random access the VCF file if they have the index file.

Implementation information can be found from
https://academic.oup.com/bioinformatics/article/27/5/718/262743
http://samtools.github.io/hts-specs/tabix.pdf
http://samtools.github.io/hts-specs/SAMv1.pdf

These changes effect a lot of api, mostly are read_vcf related, the tabix pramater have been replaced by index which use to specify the path of the index file, default to None, means that it will try to find the VCF path with a .tbi suffix.

alimanfoo

This is an amazing piece of work, huge kudos.

I added a couple of minor comments, and one question.

Also I wonder if it's worth adding a test that involves reading a region that spans at least two blocks of a bgzipped VCF, so we can be sure that reading across multiple blocks is working. (Apologies if this is already tested somehow.)

Thanks again!

allel/io/vcf_read.py

alimanfoo · 2020-01-07T11:43:41Z

allel/test/io/test_vcf_read.py

@@ -1957,7 +1954,7 @@ def test_genotype_ac():
 def test_region_truncate():
    vcf_path = fixture_path('test54.vcf.gz')
    for tabix in 'tabix', None:


Variable tabix is now unused here.

alimanfoo · 2020-01-07T11:44:09Z

allel/test/io/test_vcf_read.py

@@ -2116,11 +2113,11 @@ def test_vcf_to_npz():
    for vcf_path, region, tabix, samples, string_type in param_matrix:


Variable tabix is now unused, could be removed from param_matrix.

alimanfoo · 2020-01-07T11:44:33Z

allel/test/io/test_vcf_read.py

@@ -2187,11 +2184,11 @@ def test_vcf_to_zarr():
    for vcf_path, region, tabix, samples, string_type in param_matrix:


Variable tabix is now unused, could be removed from param_matrix.

alimanfoo · 2020-01-07T11:44:44Z

allel/test/io/test_vcf_read.py

@@ -2372,11 +2369,11 @@ def test_vcf_to_hdf5():
    for vcf_path, region, tabix, samples, string_type in param_matrix:


Variable tabix is now unused, could be removed from param_matrix.

alimanfoo · 2020-01-07T11:51:07Z

allel/io/vcfreader.py

+        # binning index: record cluster in large interval
+        overlap = np.concatenate([chunks for bin_key, chunks in bidx.items() if chunks is not None])
+        # coupled binning and linear indices, filter out low level bins
+        chunk_begin, *_, chunk_end = np.sort(np.ravel(overlap[overlap[:, 0] >= min_ioff]))


I'd be very grateful if you could explain what's happening here, for my own understanding. What do we gain by using the binning and linear indices together?

Nice question, I am not clear here, too.

So the code here suppose to get the chunk_end value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution:

The best solution: parse and check the record, once we get through the target region, stop reading. But the parse and check process has implement in the fellow process, so that is duplicate. PASS.

The easy understanding solution: using the linear index, read until the block which target region end position fall into. But this solution is a bit annoying to me. Becouse the linear index only track the begin offset, to get the end offset, we might need to parse the next chromosome index and/or check wether the block fully contain the record overlap our target region. PASS.

The solution here the code using: using the binning index, the binning index track both begin and end offset, so sort and get the biggest offset which is the point we stop reading the VCF.

There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region, for example:

-------------------------|---query-region---|----------------- ------|---record-will-be-ignore---|-------------------------

But if I understand the comment

# N.B., still pass the region parameter through so we get strictly only # variants that start within the requested region. See also # https://github.com/alimanfoo/vcfnp/issues/54

corectly, these records wil be filter out, so the solution here should be ok.

Hi @yangmqglobe, thanks so much for following up here, many apologies for slow response.

This is tricky, I will try to discuss this first:

There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region.

From Li (2011): "In the linear index, we keep for each tiling 16 kb window the virtual file offset of the leftmost record (i.e. having the smallest start coordinate) that overlaps the window."

E.g., what if you have a VCF with a large duplication that starts at position 20 kb and ends at 40 kb, and all other records are SNPs. And then you query for variants in the region 35-50 kb. You take the query start (35 kb), floor divide by 16 kb, to get the corresponding window position within linear index, in this case 2, i.e., the third window, which spans 32-48 kb. You then access the third element within the linear index, which gives you the virtual file offset for the first variant overlapping the window 32-48 kb. This will be the virtual file offset for the record containing the large deletion starting at position 20 kb, because it is the first record overlapping the 32-48 kb window.

In other words, I expect the linear index may give you variants that start before the query region. So filtering out results from the binning index won't stop this from happening.

But as you also point out, the underlying VCF parsing code in scikit-allel will filter these out when it starts reading, so I don't think we have to worry about this. The code that handles this is here. (As an aside, we might want to consider making this behaviour optional, so a user can retrieve all variants overlapping the query region like tabix does, not just those which strictly start within the query region, but that's a separate issue that could be address later.)

Regarding this point:

So the code here suppose to get the chunk_end value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution.

If the binning index gives the end offset, then this sounds like a reasonable option to figure out when to stop reading. To be honest I still don't fully understand the binning index, so I cannot make a perfect judgement here.

An alternative I think is to use the linear index to find the offset to start reading, and then just keep reading until the parser finds a record which starts beyond the query end. I.e., let the underlying parser figure out when to stop reading.

Currently the parser does check whether the current variant record starts beyond the end of the query region, code here. However, that code currently is designed for the case where a VCF is not sorted (yes that happens) and so it only skips the variant, it does not terminate parsing. However, we could pass down a piece of information that the VCF is sorted (which we know it will be if querying via tabix index), and the parser could then use this to terminate parsing rather than just skipping the record.

In other words, I think it would be possible to just use the linear index to find a suitable file offset to start reading from, and then let the parser figure out when to stop reading based on parsing records.

Does that fit your understanding? Do you have any thoughts on which strategy is better?

Yes, I wonder why there is not a sorted parameter all the time.

alimanfoo · 2020-01-07T11:54:18Z

cc @jeromekelleher, @benjeffery, I don't suppose either of you would have a chance to look at this, it looks good to me but I haven't grokked every detail and would much appreciate another pair of eyes.

yangmqglobe · 2020-01-08T02:56:26Z

@alimanfoo I will check all the tabix unused problem, won't reply everyone.
Once we solve all these problems, I might open another PR, if that is ok.

By the way, we definitely need more test, I test the function with a WGS result which contain more than 1000 mutations in a specify region, it works well. But this is a clinical patient result which can not go public due to private issue. I am not sure how to get a proper VCF to test these function, any idea of that?

jeromekelleher · 2020-01-09T08:49:57Z

This LGTM @alimanfoo, but I don't know enough about the tabix index format to given any real technical input. In terms of testing, an easy option would be to generate some VCFs using msprime, and then do some random queries. If you get the same results as Pysam, then we're probably good. It'll be a lot of work testing out all the corner cases in the binary decoding steps though.

alimanfoo · 2020-01-22T12:04:40Z

Hi @brentp, we're trying to do a pure Python implementation of querying a tabix-indexed VCF file, and thrashing around a bit trying to figure out the best approach. (We're avoiding binding to the tabix C implementation because it doesn't compile on Windows.) I know you've implemented tabix for Go and so I wondered if you might be able to share any wisdom?

alimanfoo · 2020-01-22T12:08:34Z

I am not sure how to get a proper VCF to test these function, any idea of that?

The 1000 genomes phase 3 data might be useful, could download a chromosome, then pull out a subset (e.g., enough variants to at least span a couple of bgzip blocks), then recompress and tabix index?

alimanfoo · 2020-01-22T12:11:26Z

allel/io/vcfreader.py

+        generator
+
+        """
+        begin, end = begin, end


This line doesn't seem to do anything?

My fault, the origin purpose is to convert to zero base region, but I do that in other place but forget to delete this, I will fix that.

brentp · 2020-01-22T14:33:23Z

hi, happy to try and help if I can. Is there anything in particular?
I implemented bix, which is the higher level wrapper, but the lower-level stuff is in biogo/tabix and biogo/csi.

brentp · 2020-01-22T14:38:25Z

Actually, it might help to look at the tabix-related issues in biogo as there were some edge-cases that differed from the spec.

alimanfoo · 2020-01-22T15:40:10Z

After a bit more reading, I think I have grokked this. Here's my understanding...

Given a query region, reg2bins will return a set of bins that overlap the query region. The binning index can then be used to obtain a set of file chunks for each bin, where each file chunk contains a contiguous sequence of records that all fall into that bin.

Note that any of these file chunks can contain records that do not overlap the query region. This can include records that end before the query start, and records that start after the query end. So, if using the binning index alone, we would read through each file chunk, but still need to parse each record and check whether it overlaps the query.

It would be possible to just use the binning index by itself. In some files, if there are some records that span a relatively large genome region, if we just used the binning index we might end up seeking to and reading chunks associated with higher level bins, even though none of the records in that chunk actually overlap the query region. So the idea of the linear index is to give you a file offset that is close to the start of the first record that overlaps the query region, so you can skip over some of the chunks from higher level bins returned by the binning index.

It would also be possible to just use the linear index by itself. For files where all records span a small genome region (e.g., VCF with few or no large structural variants), this works fine, because the linear index gives you a file offset to start reading from, and it will be quite efficient to just start scanning the file from that point to find all the records overlapping the query region, and stop scanning once records are beyond the query.

However, for files like GFF where there may be some records that span large genome regions (e.g., whole chromosomes), this is not efficient, because you will always end up scanning basically from the start of the chromosome. The binning index is better, because it will give a set of chunks that can be seeked to, avoiding a lot of scanning and parsing.

In the case of a VCF with few or no large SVs, it strikes me that either the binning index by itself or the linear index by itself will perform very similarly. If using the binning index, there will still be some high level bins overlapping most queries, but they won't have any chunks associated with them (n_chunks=0) and so they won't cause any unnecessary seeking.

In the general case, given any type of file, the strategy is to use the binning index to obtain a set of file chunks, then use the linear index to exclude some of the earlier file chunks which don't contain any records that overlap the query, then start reading and parsing records within the remaining chunks, checking and returning if records overlap the query. Also in general, these chunks might not be contiguous in the file, so you would want to read a chunk, seek to the start of the next chunk, read a chunk, seek to the next chunk, etc. I.e., it not be good to just start reading from the start of the first chunk and stop reading at the end of the last chunk.

For this implementation, I think this means that we need to decide whether to either (a) implement the general strategy for querying a tabix indexed file using the combined binning and linear indices, which will work well for VCF or GFF or whatever, or (b) implement a simpler strategy tuned for VCF that just uses the linear index, given that we know VCF files tend to have very few if any records that span larger genome regions.

alimanfoo · 2020-01-22T16:01:07Z

allel/io/vcfreader.py

+            for _ in range(struct.unpack('<i', f.read(4))[0]):  # n_bins
+                bin_key = struct.unpack('<I', f.read(4))[0]  # bin
+                n_chunk = struct.unpack('<i', f.read(4))[0]  # n_chunk
+                if bin_key in bidx:


n_chunk might be zero? If so, could use that to simplify a little...

Suggested change

if bin_key in bidx:

if bin_key in bidx and n_chunk > 0:

I am not sure would n_chunk be zero? But for can use as if, and a judge do nothing.

alimanfoo · 2020-01-22T16:37:56Z

Thanks a lot @brentp, the biogo links are very useful.

I'd really appreciate just a check that my understanding of how to use a .tbi index is right.

On a different note, do you know where there is a spec for .csi?

brentp · 2020-01-22T17:08:24Z

here is the spec for CSI. The implementation in biogo might be a good reference.
I'll try to have a look at your text in the next couple of days. Dan Kortschak, the main biogo author did 99.9% of the tabix and csi stuff so I have a pretty shallow understanding of it.

yangmqglobe · 2020-01-23T09:39:35Z

Sorry for slow response, too. Chinese new year is coming, happy new year!
@alimanfoo did you read http://samtools.github.io/hts-specs/SAMv1.pdf this file? Page 17 has an example that shows how to achieve a random access(.bai and .tbi use the same strategy). In my opinion the process is

use reg2bins to get bins that might contain target records
use linear index to filter out bins that won't contain target records(chunk_end < min_off)
merge all adjacent bins
seek and read

alimanfoo · 2020-01-23T09:45:22Z

Happy new year to you! That sounds right to me, with a small edit:

use reg2bins to get bins that overlap the query
use binning index to find chunks containing records for each bin overlapping the query
use linear index to filter out chunks that won't contain target records (chunk_end < min_off) (and possibly also truncate some chunks, e.g., where chunk_start < min_off < chunk_end?)
merge all adjacent chunks
seek and read

I think it is possible that some bins could have zero chunks. E.g., for a VCF file with only SNPs and small indels, I would expect all the higher level bins to have no records associated with them, and hence no chunks.

alimanfoo · 2020-01-23T09:46:25Z

... and one more step I think ...

read each record, parse columns containing start and end coordinates, and return only records that overlap the query.

yangmqglobe · 2020-01-23T09:51:45Z

I think that 0 chunk is not keep in the .tbi file ...

yangmqglobe · 2020-01-23T09:55:23Z

truncate might lose some record that are start before query region, I am planing to achieve GFF random access, too.

alimanfoo · 2020-01-23T10:12:21Z

truncate might lose some record that are start before query region, I am planing to achieve GFF random access, too.

IIUC min_off gives the virtual offset of the first record that overlaps the linear index bin that overlaps the query start. I.e., if you start reading from min_off you are guaranteed to find all records overlapping the query, there will be no records before. Therefore it should be safe to truncate any chunk that spans min_off. This would apply to any type of file, GFF or VCF or whatever.

yangmqglobe · 2020-01-23T14:33:05Z

I think that we might miss some records, tabix C code did not truncate the offset. we might need some case to prove that and that would be a necessary test case.

alimanfoo · 2020-01-23T17:25:01Z

OK, that's good to know, maybe it's not a good idea, or at least not worth it (marginal performance gain).

…

On Thu, 23 Jan 2020 at 14:33, 安迪君 ***@***.***> wrote: I think that we might miss some records, tabix C code did not truncate the offset. we might need some case to prove that and that would be a necessary test case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#297?email_source=notifications&email_token=AAFLYQUCYUOD36Q4UAIUCXLQ7GTCFA5CNFSM4KDV77RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXRVKQ#issuecomment-577706666>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQQAGLI3YT5SGLTJLRTQ7GTCFANCNFSM4KDV77RA> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

yangmanqiu added 2 commits January 7, 2020 11:16

implement tabix region query natively

7e6cc16

fix flake8 check failed

8e33fcd

alimanfoo reviewed Jan 7, 2020

View reviewed changes

alimanfoo reviewed Jan 22, 2020

View reviewed changes

		@@ -2116,11 +2113,11 @@ def test_vcf_to_npz():
		for vcf_path, region, tabix, samples, string_type in param_matrix:

		@@ -2187,11 +2184,11 @@ def test_vcf_to_zarr():
		for vcf_path, region, tabix, samples, string_type in param_matrix:

		@@ -2372,11 +2369,11 @@ def test_vcf_to_hdf5():
		for vcf_path, region, tabix, samples, string_type in param_matrix:

implement tabix region query natively in python #297

Are you sure you want to change the base?

implement tabix region query natively in python #297

Uh oh!

Conversation

yangmqglobe commented Jan 7, 2020

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Jan 7, 2020

Uh oh!

yangmqglobe commented Jan 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jan 9, 2020

Uh oh!

alimanfoo commented Jan 22, 2020

Uh oh!

alimanfoo commented Jan 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brentp commented Jan 22, 2020

Uh oh!

brentp commented Jan 22, 2020

Uh oh!

alimanfoo commented Jan 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Jan 22, 2020

Uh oh!

brentp commented Jan 22, 2020

Uh oh!

yangmqglobe commented Jan 23, 2020

Uh oh!

alimanfoo commented Jan 23, 2020

Uh oh!

alimanfoo commented Jan 23, 2020

Uh oh!

yangmqglobe commented Jan 23, 2020

Uh oh!

yangmqglobe commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Jan 23, 2020

Uh oh!

yangmqglobe commented Jan 23, 2020

Uh oh!

alimanfoo commented Jan 23, 2020 via email

Uh oh!

Uh oh!

yangmqglobe commented Jan 8, 2020 •

edited

Loading

yangmqglobe commented Jan 23, 2020 •

edited

Loading