Skip to content

implement tabix region query natively in python #297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yangmqglobe
Copy link

I have implement tabix region query natively in python, this will fix #284 and #292 , more over, this will drop the dependence of tabix and allow all OSes user random access the VCF file if they have the index file.

Implementation information can be found from
https://academic.oup.com/bioinformatics/article/27/5/718/262743
http://samtools.github.io/hts-specs/tabix.pdf
http://samtools.github.io/hts-specs/SAMv1.pdf

These changes effect a lot of api, mostly are read_vcf related, the tabix pramater have been replaced by index which use to specify the path of the index file, default to None, means that it will try to find the VCF path with a .tbi suffix.

Copy link
Contributor

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing piece of work, huge kudos.

I added a couple of minor comments, and one question.

Also I wonder if it's worth adding a test that involves reading a region that spans at least two blocks of a bgzipped VCF, so we can be sure that reading across multiple blocks is working. (Apologies if this is already tested somehow.)

Thanks again!

@@ -1957,7 +1954,7 @@ def test_genotype_ac():
def test_region_truncate():
vcf_path = fixture_path('test54.vcf.gz')
for tabix in 'tabix', None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tabix is now unused here.

@@ -2116,11 +2113,11 @@ def test_vcf_to_npz():
for vcf_path, region, tabix, samples, string_type in param_matrix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tabix is now unused, could be removed from param_matrix.

@@ -2187,11 +2184,11 @@ def test_vcf_to_zarr():
for vcf_path, region, tabix, samples, string_type in param_matrix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tabix is now unused, could be removed from param_matrix.

@@ -2372,11 +2369,11 @@ def test_vcf_to_hdf5():
for vcf_path, region, tabix, samples, string_type in param_matrix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tabix is now unused, could be removed from param_matrix.

# binning index: record cluster in large interval
overlap = np.concatenate([chunks for bin_key, chunks in bidx.items() if chunks is not None])
# coupled binning and linear indices, filter out low level bins
chunk_begin, *_, chunk_end = np.sort(np.ravel(overlap[overlap[:, 0] >= min_ioff]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be very grateful if you could explain what's happening here, for my own understanding. What do we gain by using the binning and linear indices together?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice question, I am not clear here, too.

So the code here suppose to get the chunk_end value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution:

  1. The best solution: parse and check the record, once we get through the target region, stop reading. But the parse and check process has implement in the fellow process, so that is duplicate. PASS.

  2. The easy understanding solution: using the linear index, read until the block which target region end position fall into. But this solution is a bit annoying to me. Becouse the linear index only track the begin offset, to get the end offset, we might need to parse the next chromosome index and/or check wether the block fully contain the record overlap our target region. PASS.

  3. The solution here the code using: using the binning index, the binning index track both begin and end offset, so sort and get the biggest offset which is the point we stop reading the VCF.

There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region, for example:

-------------------------|---query-region---|-----------------
------|---record-will-be-ignore---|-------------------------

But if I understand the comment

# N.B., still pass the region parameter through so we get strictly only
# variants that start within the requested region. See also
# https://github.com/alimanfoo/vcfnp/issues/54

corectly, these records wil be filter out, so the solution here should be ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yangmqglobe, thanks so much for following up here, many apologies for slow response.

This is tricky, I will try to discuss this first:

There is another question related to these code, too. Here, I filter out the lower level bins which might contain records that start before target region but overlap our target region.

From Li (2011): "In the linear index, we keep for each tiling 16 kb window the virtual file offset of the leftmost record (i.e. having the smallest start coordinate) that overlaps the window."

E.g., what if you have a VCF with a large duplication that starts at position 20 kb and ends at 40 kb, and all other records are SNPs. And then you query for variants in the region 35-50 kb. You take the query start (35 kb), floor divide by 16 kb, to get the corresponding window position within linear index, in this case 2, i.e., the third window, which spans 32-48 kb. You then access the third element within the linear index, which gives you the virtual file offset for the first variant overlapping the window 32-48 kb. This will be the virtual file offset for the record containing the large deletion starting at position 20 kb, because it is the first record overlapping the 32-48 kb window.

In other words, I expect the linear index may give you variants that start before the query region. So filtering out results from the binning index won't stop this from happening.

But as you also point out, the underlying VCF parsing code in scikit-allel will filter these out when it starts reading, so I don't think we have to worry about this. The code that handles this is here. (As an aside, we might want to consider making this behaviour optional, so a user can retrieve all variants overlapping the query region like tabix does, not just those which strictly start within the query region, but that's a separate issue that could be address later.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this point:

So the code here suppose to get the chunk_end value, becouse we want to stop reading the VCF as soon as we get through the target region. To do this, in my opinion, we have 3 solution.

If the binning index gives the end offset, then this sounds like a reasonable option to figure out when to stop reading. To be honest I still don't fully understand the binning index, so I cannot make a perfect judgement here.

An alternative I think is to use the linear index to find the offset to start reading, and then just keep reading until the parser finds a record which starts beyond the query end. I.e., let the underlying parser figure out when to stop reading.

Currently the parser does check whether the current variant record starts beyond the end of the query region, code here. However, that code currently is designed for the case where a VCF is not sorted (yes that happens) and so it only skips the variant, it does not terminate parsing. However, we could pass down a piece of information that the VCF is sorted (which we know it will be if querying via tabix index), and the parser could then use this to terminate parsing rather than just skipping the record.

In other words, I think it would be possible to just use the linear index to find a suitable file offset to start reading from, and then let the parser figure out when to stop reading based on parsing records.

Does that fit your understanding? Do you have any thoughts on which strategy is better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wonder why there is not a sorted parameter all the time.

@alimanfoo
Copy link
Contributor

cc @jeromekelleher, @benjeffery, I don't suppose either of you would have a chance to look at this, it looks good to me but I haven't grokked every detail and would much appreciate another pair of eyes.

@yangmqglobe
Copy link
Author

yangmqglobe commented Jan 8, 2020

@alimanfoo I will check all the tabix unused problem, won't reply everyone.
Once we solve all these problems, I might open another PR, if that is ok.

By the way, we definitely need more test, I test the function with a WGS result which contain more than 1000 mutations in a specify region, it works well. But this is a clinical patient result which can not go public due to private issue. I am not sure how to get a proper VCF to test these function, any idea of that?

@jeromekelleher
Copy link
Contributor

This LGTM @alimanfoo, but I don't know enough about the tabix index format to given any real technical input. In terms of testing, an easy option would be to generate some VCFs using msprime, and then do some random queries. If you get the same results as Pysam, then we're probably good. It'll be a lot of work testing out all the corner cases in the binary decoding steps though.

@alimanfoo
Copy link
Contributor

Hi @brentp, we're trying to do a pure Python implementation of querying a tabix-indexed VCF file, and thrashing around a bit trying to figure out the best approach. (We're avoiding binding to the tabix C implementation because it doesn't compile on Windows.) I know you've implemented tabix for Go and so I wondered if you might be able to share any wisdom?

@alimanfoo
Copy link
Contributor

I am not sure how to get a proper VCF to test these function, any idea of that?

The 1000 genomes phase 3 data might be useful, could download a chromosome, then pull out a subset (e.g., enough variants to at least span a couple of bgzip blocks), then recompress and tabix index?

generator

"""
begin, end = begin, end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't seem to do anything?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault, the origin purpose is to convert to zero base region, but I do that in other place but forget to delete this, I will fix that.

@brentp
Copy link

brentp commented Jan 22, 2020

hi, happy to try and help if I can. Is there anything in particular?
I implemented bix, which is the higher level wrapper, but the lower-level stuff is in biogo/tabix and biogo/csi.

@brentp
Copy link

brentp commented Jan 22, 2020

Actually, it might help to look at the tabix-related issues in biogo as there were some edge-cases that differed from the spec.

@alimanfoo
Copy link
Contributor

After a bit more reading, I think I have grokked this. Here's my understanding...

Given a query region, reg2bins will return a set of bins that overlap the query region. The binning index can then be used to obtain a set of file chunks for each bin, where each file chunk contains a contiguous sequence of records that all fall into that bin.

Note that any of these file chunks can contain records that do not overlap the query region. This can include records that end before the query start, and records that start after the query end. So, if using the binning index alone, we would read through each file chunk, but still need to parse each record and check whether it overlaps the query.

It would be possible to just use the binning index by itself. In some files, if there are some records that span a relatively large genome region, if we just used the binning index we might end up seeking to and reading chunks associated with higher level bins, even though none of the records in that chunk actually overlap the query region. So the idea of the linear index is to give you a file offset that is close to the start of the first record that overlaps the query region, so you can skip over some of the chunks from higher level bins returned by the binning index.

It would also be possible to just use the linear index by itself. For files where all records span a small genome region (e.g., VCF with few or no large structural variants), this works fine, because the linear index gives you a file offset to start reading from, and it will be quite efficient to just start scanning the file from that point to find all the records overlapping the query region, and stop scanning once records are beyond the query.

However, for files like GFF where there may be some records that span large genome regions (e.g., whole chromosomes), this is not efficient, because you will always end up scanning basically from the start of the chromosome. The binning index is better, because it will give a set of chunks that can be seeked to, avoiding a lot of scanning and parsing.

In the case of a VCF with few or no large SVs, it strikes me that either the binning index by itself or the linear index by itself will perform very similarly. If using the binning index, there will still be some high level bins overlapping most queries, but they won't have any chunks associated with them (n_chunks=0) and so they won't cause any unnecessary seeking.

In the general case, given any type of file, the strategy is to use the binning index to obtain a set of file chunks, then use the linear index to exclude some of the earlier file chunks which don't contain any records that overlap the query, then start reading and parsing records within the remaining chunks, checking and returning if records overlap the query. Also in general, these chunks might not be contiguous in the file, so you would want to read a chunk, seek to the start of the next chunk, read a chunk, seek to the next chunk, etc. I.e., it not be good to just start reading from the start of the first chunk and stop reading at the end of the last chunk.

For this implementation, I think this means that we need to decide whether to either (a) implement the general strategy for querying a tabix indexed file using the combined binning and linear indices, which will work well for VCF or GFF or whatever, or (b) implement a simpler strategy tuned for VCF that just uses the linear index, given that we know VCF files tend to have very few if any records that span larger genome regions.

for _ in range(struct.unpack('<i', f.read(4))[0]): # n_bins
bin_key = struct.unpack('<I', f.read(4))[0] # bin
n_chunk = struct.unpack('<i', f.read(4))[0] # n_chunk
if bin_key in bidx:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_chunk might be zero? If so, could use that to simplify a little...

Suggested change
if bin_key in bidx:
if bin_key in bidx and n_chunk > 0:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure would n_chunk be zero? But for can use as if, and a judge do nothing.

@alimanfoo
Copy link
Contributor

Thanks a lot @brentp, the biogo links are very useful.

I'd really appreciate just a check that my understanding of how to use a .tbi index is right.

On a different note, do you know where there is a spec for .csi?

@brentp
Copy link

brentp commented Jan 22, 2020

here is the spec for CSI. The implementation in biogo might be a good reference.
I'll try to have a look at your text in the next couple of days. Dan Kortschak, the main biogo author did 99.9% of the tabix and csi stuff so I have a pretty shallow understanding of it.

@yangmqglobe
Copy link
Author

Sorry for slow response, too. Chinese new year is coming, happy new year!
@alimanfoo did you read http://samtools.github.io/hts-specs/SAMv1.pdf this file? Page 17 has an example that shows how to achieve a random access(.bai and .tbi use the same strategy). In my opinion the process is

  1. use reg2bins to get bins that might contain target records
  2. use linear index to filter out bins that won't contain target records(chunk_end < min_off)
  3. merge all adjacent bins
  4. seek and read

@alimanfoo
Copy link
Contributor

Happy new year to you! That sounds right to me, with a small edit:

  1. use reg2bins to get bins that overlap the query
  2. use binning index to find chunks containing records for each bin overlapping the query
  3. use linear index to filter out chunks that won't contain target records (chunk_end < min_off) (and possibly also truncate some chunks, e.g., where chunk_start < min_off < chunk_end?)
  4. merge all adjacent chunks
  5. seek and read

I think it is possible that some bins could have zero chunks. E.g., for a VCF file with only SNPs and small indels, I would expect all the higher level bins to have no records associated with them, and hence no chunks.

@alimanfoo
Copy link
Contributor

... and one more step I think ...

  1. read each record, parse columns containing start and end coordinates, and return only records that overlap the query.

@yangmqglobe
Copy link
Author

I think that 0 chunk is not keep in the .tbi file ...

@yangmqglobe
Copy link
Author

yangmqglobe commented Jan 23, 2020

truncate might lose some record that are start before query region, I am planing to achieve GFF random access, too.

@alimanfoo
Copy link
Contributor

truncate might lose some record that are start before query region, I am planing to achieve GFF random access, too.

IIUC min_off gives the virtual offset of the first record that overlaps the linear index bin that overlaps the query start. I.e., if you start reading from min_off you are guaranteed to find all records overlapping the query, there will be no records before. Therefore it should be safe to truncate any chunk that spans min_off. This would apply to any type of file, GFF or VCF or whatever.

@yangmqglobe
Copy link
Author

I think that we might miss some records, tabix C code did not truncate the offset. we might need some case to prove that and that would be a necessary test case.

@alimanfoo
Copy link
Contributor

alimanfoo commented Jan 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why not try to find the exits .tbi file but run tabix again?
4 participants