Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ntHash with a Phred filter #107

Open
jwcodee opened this issue Jul 7, 2023 · 6 comments
Open

ntHash with a Phred filter #107

jwcodee opened this issue Jul 7, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@jwcodee
Copy link
Member

jwcodee commented Jul 7, 2023

Enhancement:
I am thinking we could have a ntHash iterator with a phred filter. basically we could add a member which is a rolling sum of the Phred quality rather than the average. Every time we compute the k-mer hash, we check the quality score. If pass, we output, otherwise roll to the next hash.

@jwcodee jwcodee added the enhancement New feature or request label Jul 7, 2023
@parham-k
Copy link
Member

parham-k commented Jul 7, 2023

Good idea, I like it. Is it something we'd like to implement directly in ntHash though? I think it can be a separate class (as a template, e.g., PhredFilter<NtHash>) that rolls ntHash and updates the Phred sum in each iteration.

@jwcodee
Copy link
Member Author

jwcodee commented Jul 7, 2023

Yeah that would be my approach too. The new class checks the phred sum / avg and rolls the nthash iterator as necessary. Are you thinking of template because of seednthash?

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

@jwcodee jwcodee self-assigned this Jul 7, 2023
@parham-k
Copy link
Member

parham-k commented Jul 7, 2023

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

@jwcodee
Copy link
Member Author

jwcodee commented Jul 7, 2023

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

@parham-k
Copy link
Member

parham-k commented Jul 7, 2023

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

They have the same methods, but there's no OO relationship between them. They're more like four separate classes, so we'd have to implement something like SeedPhreadNtHash and also two blind versions-BlindPhreadNtHash and BlindSeedPhreadNtHash :D

I'm more or less on the template class side, it'll be much cleaner. We can chat about this more, your call eventually.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

I was thinking of the range minimum query (RMQ) problem, but since we're sliding a window, we should be able to use a deque even in a case of multiple low-quality bases (e.g., this solution?)

@jwcodee
Copy link
Member Author

jwcodee commented Jul 7, 2023

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

They have the same methods, but there's no OO relationship between them. They're more like four separate classes, so we'd have to implement something like SeedPhreadNtHash and also two blind versions-BlindPhreadNtHash and BlindSeedPhreadNtHash :D

I'm more or less on the template class side, it'll be much cleaner. We can chat about this more, your call eventually.

In that case, let's do template then.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

I was thinking of the range minimum query (RMQ) problem, but since we're sliding a window, we should be able to use a deque even in a case of multiple low-quality bases (e.g., this solution?)

This will need some discussion too since we need to decide which one to go with. I do worry that on a base lvl it might be too stringent.

Of course we could implement both

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants