Skip to content

Implement Hamming distance for binary strings #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

nnethercott
Copy link
Contributor

Pull Request

Had some time this weekend and wanted to try to incorporate hamming into arroy !

Check comments in the PR to explain some of the logic

Related issue

Fixes #102

What does this PR do?

  • Adds Hamming distance and a new {0,1}-quantized UnalignedVectorCodec impl for binary strings
  • Updates corresponding SIMD instructions for binary ops

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@nnethercott
Copy link
Contributor Author

nnethercott commented Apr 21, 2025

some logic

Brain dumping here so the PR has a bit of context :)

why a new codec?

The BinaryQuantized codec maps an element to +1 or -1 by checking the sign bit in its IEEE 754 f32 representation using elem.is_sign_positive(). This means that +0.0 gets quantized to 1, while -0.0 becomes -1. If users are indexing binary sequences (Vec with values 1.0 and 0.0 strictly) using this previous codec would quantize everything to 1s.

In the Binary codec I added we check if a value x: f32 is strictly positive (e.g. x > 0.0 by doing something like

 bits = scalar.to_bits();
 bits < 0x8000_0000 && bits > 0x0000_0000 

I benchmarked a few approaches and this one was faster than the naive check by a nontrivial amount (~550 ps vs ~750 ps on my old macbook). Might be overkill but hey 🤠

how are splits created with hamming distance

Here I took inspiration from how it was done in annoy. Rather than building a locality-sensitive hash using separating hyperplanes as in the other distance impls, bit sampling can accomplish this. The normal vector that comes out of the create_split function is just a one-hot encoded vector representing a valid splitting index.

To make sure the same index isn't selected at various depths in a given tree we require the normal to be a valid splitting index; i.e. there's at least one element on Side::Left and Side::Right. This is accomplished through sampling with replacement here.

where this deviates with the source implementation

  • our normal vector is an indicator mask (all zeros with a random bit set) while in annoy they store the random index as a u32 in the first entry of the normal vector
  • the annoy code loops through all leafs in a subset multiple times, but in arroy we only have a public API for randomly choosing nodes from an ImmutableSubsetLeafs. So instead we randomly sample nodes with replacement to achieve similar behaviour

interpretation of the margin

Not geometric like with other distances, we just need to know if a particular bit is set for a target vector. To achieve this, create_split returns a mask (one-hot) vector with a random bit set to 1 and we take the logical AND with the target vector.

hamming-bitwise-fast

In the original issue a great blog post on an efficient implementation of the hamming distance is linked. En gros if you structure the code right the compiler will auto-vectorize things pretty well.

@irevoire
Copy link
Member

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

@nnethercott
Copy link
Contributor Author

nnethercott commented May 16, 2025

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

@irevoire no worries, take your time :)

Using provided implementation randomly selects a side if the
margin_no_header is 0.

Override side impl for Hamming

Using provided implementation randomly selects a side if the
margin_no_header is 0.
in annoy it's [defined like](https://github.com/spotify/annoy/blob/main/src/annoylib.h#L718C5-L718C59)
```
distance - (margin != (unsigned int) child_nr);
```
where margin is a bool and child_nr=0 => Side::Left and child_nr=1 =>
Side::Right in arroy terms. Lets call the second term in the above the
"penalty".

Reading this we have 4 cases for margin x side;
- margin = 0, child_nr = 0 => penalty = 0
- margin = 1, child_nr = 0 => penalty = 1
- margin = 0, child_nr = 1 => penalty = 1
- margin = 1, child_nr = 1 => penalty = 0

so its something like margin XOR child_nr. For side=Left we get
penalty = margin, and for side=Right we get penalty = 1-margin.
@nnethercott nnethercott marked this pull request as draft June 2, 2025 06:45
irevoire and others added 16 commits June 2, 2025 13:47
margin_no_header is a footgun here, since signature for margin is the
same its easy to confound the two. this is bad since we may build a tree
not using the hyperplane bias but then configure the reader to use it
(for instance)
different platforms show varying numbers of digits in f32 debug impl =>
failing tests on linux vs macos runners. to fix this we ensure only 4
digits are printed
performances were terrible before, but changing margin to map to {-1,1}
instead of {0,1} and keeping trait default for side and pq_distance
fixed things
@nnethercott nnethercott mentioned this pull request Jun 2, 2025
3 tasks
@nnethercott
Copy link
Contributor Author

Update: I'm keeping this PR as a draft until #132 is merged (hopefully) cause that PR would allow us to store the splitting index as a usize in the node header instead of requiring it be stored as a one-hot vector in the index -- this means we can use way less space to store the splitting node since no vector info is needed !

A branch based off #132 with the hamming distance implemented can be found here.

In the meantime here's some results on the vector-store-relevancy-benchmark for 100k vectors and an index of 100 trees (recall for hamming often significantly outperforms all other quantized formats across several datasets):
Screenshot from 2025-06-03 00-22-14

@nnethercott nnethercott marked this pull request as ready for review June 10, 2025 09:36
@nnethercott nnethercott marked this pull request as draft June 10, 2025 09:36
@nnethercott nnethercott marked this pull request as ready for review June 10, 2025 10:12
@nnethercott
Copy link
Contributor Author

nnethercott commented Jun 10, 2025

@irevoire hamming's back on the menu 🍝

SplitPlaneNormals now store a random index in the node header and an empty vec. Turns out hamming is pretty good on recall too (pic below on 100k vectors, didn't update the print statement :p) .

Modified vector-store-relevancy-benchmark code here for hamming (used dummy for qdrant since it isn't supported).

Screenshot from 2025-06-10 12-11-12

Sorry for the 50 commits also lol, i just merged main into this. Can work some git rebase magic later or we just sqash merge this bad boy.

@irevoire
Copy link
Member

Hey, the parallelization work is almost done, it's just missing some error handling on rayon, and then it'll be ready to review.
If all the benchmarks I'll do this weekend give good results, you can expect it to be merged by Tuesday, basically, and then I'll have the time to review your PR 😁

@nnethercott
Copy link
Contributor Author

nnethercott commented Jun 12, 2025

Hey, the #130 is almost done

Ayy this is massive ! I tried looking into this a few days back but hit a wall, I'm glad a better mind was able to crack it ahah. Gonna check out your PR this weekend cause I'm insanely curious now

then I'll have the time to review your PR

woot woot ! again no rush :) I'll try to tackle #134 in the meantime !

Edit: just seeing this now but is ur pfp a cowboy bebop ref ahah

@irevoire
Copy link
Member

Ayy this is massive ! I tried looking into this a few days back but hit a wall, I'm glad a better mind was able to crack it ahah. Gonna check out your PR this weekend cause I'm insanely curious now

I'm so happy we didn't really lose much performance in the process! I didn't think that was possible, ahah

Edit: just seeing this now but is ur pfp a cowboy bebop ref ahah

image

That would be me if my dog were smaller

@nnethercott
Copy link
Contributor Author

That would be me if my dog were smaller

see you space cowboy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Hamming distance for binary quantization
2 participants