Skip to content

Implement Hamming distance for binary strings #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nnethercott
Copy link

Pull Request

Had some time this weekend and wanted to try to incorporate hamming into arroy !

Check comments in the PR to explain some of the logic

Related issue

Fixes #102

What does this PR do?

  • Adds Hamming distance and a new {0,1}-quantized UnalignedVectorCodec impl for binary strings
  • Updates corresponding SIMD instructions for binary ops

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@nnethercott
Copy link
Author

nnethercott commented Apr 21, 2025

some logic

Brain dumping here so the PR has a bit of context :)

why a new codec?

The BinaryQuantized codec maps an element to +1 or -1 by checking the sign bit in its IEEE 754 f32 representation using elem.is_sign_positive(). This means that +0.0 gets quantized to 1, while -0.0 becomes -1. If users are indexing binary sequences (Vec with values 1.0 and 0.0 strictly) using this previous codec would quantize everything to 1s.

In the Binary codec I added we check if a value x: f32 is strictly positive (e.g. x > 0.0 by doing something like

 bits = scalar.to_bits();
 bits < 0x8000_0000 && bits > 0x0000_0000 

I benchmarked a few approaches and this one was faster than the naive check by a nontrivial amount (~550 ps vs ~750 ps on my old macbook). Might be overkill but hey 🤠

how are splits created with hamming distance

Here I took inspiration from how it was done in annoy. Rather than building a locality-sensitive hash using separating hyperplanes as in the other distance impls, bit sampling can accomplish this. The normal vector that comes out of the create_split function is just a one-hot encoded vector representing a valid splitting index.

where this deviates with the source implementation

  • our normal vector is an indicator mask (all zeros with a random bit set) while in annoy they store the random index as a u32 in the first entry of the normal vector
  • the annoy code loops through all leafs in a subset multiple times, but in arroy we only have a public API for randomly choosing nodes from an ImmutableSubsetLeafs. So instead we randomly sample nodes with replacement to achieve similar behaviour

interpretation of the margin

Not geometric like with other distances, we just need to know if a particular bit is set for a target vector. To achieve this, create_split returns a mask (one-hot) vector with a random bit set to 1 and we take the logical AND with the target vector.

hamming-bitwise-fast

In the original issue a great blog post on an efficient implementation of the hamming distance is linked. En gros if you structure the code right the compiler will auto-vectorize things pretty well.

@irevoire
Copy link
Member

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

@nnethercott
Copy link
Author

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

@irevoire no worries, take your time :)

Looking forward to seeing how you solve #118 !

Using provided implementation randomly selects a side if the
margin_no_header is 0.

Override side impl for Hamming

Using provided implementation randomly selects a side if the
margin_no_header is 0.
in annoy it's [defined like](https://github.com/spotify/annoy/blob/main/src/annoylib.h#L718C5-L718C59)
```
distance - (margin != (unsigned int) child_nr);
```
where margin is a bool and child_nr=0 => Side::Left and child_nr=1 =>
Side::Right in arroy terms. Lets call the second term in the above the
"penalty".

Reading this we have 4 cases for margin x side;
- margin = 0, child_nr = 0 => penalty = 0
- margin = 1, child_nr = 0 => penalty = 1
- margin = 0, child_nr = 1 => penalty = 1
- margin = 1, child_nr = 1 => penalty = 0

so its something like margin XOR child_nr. For side=Left we get
penalty = margin, and for side=Right we get penalty = 1-margin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Hamming distance for binary quantization
2 participants