-
Notifications
You must be signed in to change notification settings - Fork 12
Implement Hamming distance for binary strings #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
some logicBrain dumping here so the PR has a bit of context :) why a new codec?The In the bits = scalar.to_bits();
bits < 0x8000_0000 && bits > 0x0000_0000 I benchmarked a few approaches and this one was faster than the naive check by a nontrivial amount (~550 ps vs ~750 ps on my old macbook). Might be overkill but hey 🤠 how are splits created with hamming distanceHere I took inspiration from how it was done in annoy. Rather than building a locality-sensitive hash using separating hyperplanes as in the other distance impls, bit sampling can accomplish this. The normal vector that comes out of the where this deviates with the source implementation
interpretation of the marginNot geometric like with other distances, we just need to know if a particular bit is set for a target vector. To achieve this, hamming-bitwise-fastIn the original issue a great blog post on an efficient implementation of the hamming distance is linked. En gros if you structure the code right the compiler will auto-vectorize things pretty well. |
Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈 I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change! |
@irevoire no worries, take your time :) Looking forward to seeing how you solve #118 ! |
Using provided implementation randomly selects a side if the margin_no_header is 0. Override side impl for Hamming Using provided implementation randomly selects a side if the margin_no_header is 0.
in annoy it's [defined like](https://github.com/spotify/annoy/blob/main/src/annoylib.h#L718C5-L718C59) ``` distance - (margin != (unsigned int) child_nr); ``` where margin is a bool and child_nr=0 => Side::Left and child_nr=1 => Side::Right in arroy terms. Lets call the second term in the above the "penalty". Reading this we have 4 cases for margin x side; - margin = 0, child_nr = 0 => penalty = 0 - margin = 1, child_nr = 0 => penalty = 1 - margin = 0, child_nr = 1 => penalty = 1 - margin = 1, child_nr = 1 => penalty = 0 so its something like margin XOR child_nr. For side=Left we get penalty = margin, and for side=Right we get penalty = 1-margin.
Pull Request
Had some time this weekend and wanted to try to incorporate hamming into arroy !
Check comments in the PR to explain some of the logic
Related issue
Fixes #102
What does this PR do?
Hamming
distance and a new {0,1}-quantizedUnalignedVectorCodec
impl for binary stringsPR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!