Implement Hamming distance for binary strings #124

nnethercott · 2025-04-21T21:57:24Z

Pull Request

Had some time this weekend and wanted to try to incorporate hamming into arroy !

Check comments in the PR to explain some of the logic

Related issue

Fixes #102

What does this PR do?

Adds Hamming distance and a new {0,1}-quantized UnalignedVectorCodec impl for binary strings
Updates corresponding SIMD instructions for binary ops

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

nnethercott · 2025-04-21T22:37:51Z

some logic

Brain dumping here so the PR has a bit of context :)

why a new codec?

The BinaryQuantized codec maps an element to +1 or -1 by checking the sign bit in its IEEE 754 f32 representation using elem.is_sign_positive(). This means that +0.0 gets quantized to 1, while -0.0 becomes -1. If users are indexing binary sequences (Vec with values 1.0 and 0.0 strictly) using this previous codec would quantize everything to 1s.

In the Binary codec I added we check if a value x: f32 is strictly positive (e.g. x > 0.0 by doing something like

 bits = scalar.to_bits();
 bits < 0x8000_0000 && bits > 0x0000_0000

I benchmarked a few approaches and this one was faster than the naive check by a nontrivial amount (~550 ps vs ~750 ps on my old macbook). Might be overkill but hey 🤠

how are splits created with hamming distance

Here I took inspiration from how it was done in annoy. Rather than building a locality-sensitive hash using separating hyperplanes as in the other distance impls, bit sampling can accomplish this. The normal vector that comes out of the create_split function is just a one-hot encoded vector representing a valid splitting index.

To make sure the same index isn't selected at various depths in a given tree we require the normal to be a valid splitting index; i.e. there's at least one element on Side::Left and Side::Right. This is accomplished through sampling with replacement here.

where this deviates with the source implementation

our normal vector is an indicator mask (all zeros with a random bit set) while in annoy they store the random index as a u32 in the first entry of the normal vector
the annoy code loops through all leafs in a subset multiple times, but in arroy we only have a public API for randomly choosing nodes from an ImmutableSubsetLeafs. So instead we randomly sample nodes with replacement to achieve similar behaviour

interpretation of the margin

Not geometric like with other distances, we just need to know if a particular bit is set for a target vector. To achieve this, create_split returns a mask (one-hot) vector with a random bit set to 1 and we take the logical AND with the target vector.

hamming-bitwise-fast

In the original issue a great blog post on an efficient implementation of the hamming distance is linked. En gros if you structure the code right the compiler will auto-vectorize things pretty well.

irevoire · 2025-05-15T14:35:54Z

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

nnethercott · 2025-05-16T07:51:05Z

Hey @nnethercott, sorry for the huge delay, I was on holiday for the past weeks, and currently I'm a bit in a rush, I really need to finish #118 before the next Meilisearch release, so I won't be able to give you a proper review until some time sorry 🙈

I quickly skimmed through your code, and it looks pretty good and well-documented. Thanks a lot for that, I think we'll be able to merge it with little change!

@irevoire no worries, take your time :)

Using provided implementation randomly selects a side if the margin_no_header is 0. Override side impl for Hamming Using provided implementation randomly selects a side if the margin_no_header is 0.

in annoy it's [defined like](https://github.com/spotify/annoy/blob/main/src/annoylib.h#L718C5-L718C59) ``` distance - (margin != (unsigned int) child_nr); ``` where margin is a bool and child_nr=0 => Side::Left and child_nr=1 => Side::Right in arroy terms. Lets call the second term in the above the "penalty". Reading this we have 4 cases for margin x side; - margin = 0, child_nr = 0 => penalty = 0 - margin = 1, child_nr = 0 => penalty = 1 - margin = 0, child_nr = 1 => penalty = 1 - margin = 1, child_nr = 1 => penalty = 0 so its something like margin XOR child_nr. For side=Left we get penalty = margin, and for side=Right we get penalty = 1-margin.

…litnodes

Co-authored-by: Tamo <[email protected]>

margin_no_header is a footgun here, since signature for margin is the same its easy to confound the two. this is bad since we may build a tree not using the hyperplane bias but then configure the reader to use it (for instance)

different platforms show varying numbers of digits in f32 debug impl => failing tests on linux vs macos runners. to fix this we ensure only 4 digits are printed

performances were terrible before, but changing margin to map to {-1,1} instead of {0,1} and keeping trait default for side and pq_distance fixed things

nnethercott · 2025-06-02T22:26:27Z

Update: I'm keeping this PR as a draft until #132 is merged (hopefully) cause that PR would allow us to store the splitting index as a usize in the node header instead of requiring it be stored as a one-hot vector in the index -- this means we can use way less space to store the splitting node since no vector info is needed !

A branch based off #132 with the hamming distance implemented can be found here.

In the meantime here's some results on the vector-store-relevancy-benchmark for 100k vectors and an index of 100 trees (recall for hamming often significantly outperforms all other quantized formats across several datasets):

nnethercott · 2025-06-10T10:12:52Z

@irevoire hamming's back on the menu 🍝

SplitPlaneNormals now store a random index in the node header and an empty vec. Turns out hamming is pretty good on recall too (pic below on 100k vectors, didn't update the print statement :p) .

Modified vector-store-relevancy-benchmark code here for hamming (used dummy for qdrant since it isn't supported).

Sorry for the 50 commits also lol, i just merged main into this. Can work some git rebase magic later or we just sqash merge this bad boy.

irevoire · 2025-06-12T17:46:08Z

Hey, the parallelization work is almost done, it's just missing some error handling on rayon, and then it'll be ready to review.
If all the benchmarks I'll do this weekend give good results, you can expect it to be merged by Tuesday, basically, and then I'll have the time to review your PR 😁

nnethercott · 2025-06-12T21:12:54Z

Hey, the #130 is almost done

Ayy this is massive ! I tried looking into this a few days back but hit a wall, I'm glad a better mind was able to crack it ahah. Gonna check out your PR this weekend cause I'm insanely curious now

then I'll have the time to review your PR

woot woot ! again no rush :) I'll try to tackle #134 in the meantime !

Edit: just seeing this now but is ur pfp a cowboy bebop ref ahah

irevoire · 2025-06-13T10:30:50Z

Ayy this is massive ! I tried looking into this a few days back but hit a wall, I'm glad a better mind was able to crack it ahah. Gonna check out your PR this weekend cause I'm insanely curious now

I'm so happy we didn't really lose much performance in the process! I didn't think that was possible, ahah

Edit: just seeing this now but is ur pfp a cowboy bebop ref ahah

That would be me if my dog were smaller

nnethercott · 2025-06-13T17:34:05Z

That would be me if my dog were smaller

see you space cowboy...

nnethercott added 4 commits April 21, 2025 11:21

add zero-one binary codec

7b45b21

update instructions for to_vec_neon

ad8d2cf

implement hamming distance

e541ed8

update tests

217a982

nnethercott added 2 commits April 23, 2025 16:31

remove copy pasta

18aeda2

cargo fmt

870fa98

Override side impl for Hamming

1bfe320

Using provided implementation randomly selects a side if the margin_no_header is 0. Override side impl for Hamming Using provided implementation randomly selects a side if the margin_no_header is 0.

nnethercott force-pushed the hamming branch from c67cdc7 to 1bfe320 Compare May 16, 2025 20:34

nnethercott added 3 commits May 19, 2025 23:01

make normalized_distance like in annoy

9b5d161

oops untrack benches

bd90930

nnethercott marked this pull request as draft June 2, 2025 06:45

nnethercott and others added 15 commits June 2, 2025 13:47

replace binary heap with median-based top k

bc9ed83

style: rename function

e620482

clean up helper

e0fb20c

Update rustc to 1.82

1076d9e

handle none as a valid splitplane normal

7860c28

add an upgrade function

e2abf8b

add a test on the dumpless upgrade

b174c8d

fmt

29f2725

max to min

49ac5f6

add trait bounds to median-based top k

b96afab

add proptest

5d9d4ad

rename helper function

35d39e2

fix range in proptest

c8b7edb

update the writer

40d5484

Update the version in the cargo toml

219b914

irevoire and others added 16 commits June 2, 2025 13:47

Add a failing test triggering the bug when we don't update all the sp…

24ee88d

…litnodes

Fix the broken update and test

22544a4

Add link to quickwit

2440418

Co-authored-by: Tamo <[email protected]>

apply review suggestions, simplify top k fn

9e4997f

feat: make normal a Leaf<'_, Distance>

27663eb

kill margin_no_header

61f2cc6

margin_no_header is a footgun here, since signature for margin is the same its easy to confound the two. this is bad since we may build a tree not using the hyperplane bias but then configure the reader to use it (for instance)

feat: add quantized bias

b021b42

cargo fmt

87e0dfe

bump tests

b09d689

add debug impls for all node headers

de76c36

different platforms show varying numbers of digits in f32 debug impl => failing tests on linux vs macos runners. to fix this we ensure only 4 digits are printed

revert clippy fix for this line

6b3459a

reupdate snapshots after new debug impl

ae043a1

Merge branch 'add-header-to-normal' into hamming-with-header

63b5426

move sampled bit to header

05063f8

feat: modify margin, reuse mod::side and mod::pq_distance

4e79990

performances were terrible before, but changing margin to map to {-1,1} instead of {0,1} and keeping trait default for side and pq_distance fixed things

don't expand vec in margin

bb71c68

nnethercott mentioned this pull request Jun 2, 2025

Add bias info to SplitPlaneNormal #132

Merged

3 tasks

nnethercott marked this pull request as ready for review June 10, 2025 09:36

nnethercott marked this pull request as draft June 10, 2025 09:36

nnethercott added 2 commits June 10, 2025 11:38

Merge branch 'main' into hamming-with-header

84d3f80

rebase on main

57ca678

nnethercott marked this pull request as ready for review June 10, 2025 10:12

appease the almighty clippy gods

039ecab

irevoire mentioned this pull request Jun 23, 2025

Still too slow when there is not enough RAM #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Hamming distance for binary strings #124

Implement Hamming distance for binary strings #124

Uh oh!

nnethercott commented Apr 21, 2025

Uh oh!

nnethercott commented Apr 21, 2025 •

edited

Loading

Uh oh!

irevoire commented May 15, 2025

Uh oh!

nnethercott commented May 16, 2025 •

edited

Loading

Uh oh!

nnethercott commented Jun 2, 2025

Uh oh!

nnethercott commented Jun 10, 2025 •

edited

Loading

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

nnethercott commented Jun 12, 2025 •

edited

Loading

Uh oh!

irevoire commented Jun 13, 2025

Uh oh!

nnethercott commented Jun 13, 2025

Uh oh!

Uh oh!

Implement Hamming distance for binary strings #124

Are you sure you want to change the base?

Implement Hamming distance for binary strings #124

Uh oh!

Conversation

nnethercott commented Apr 21, 2025

Pull Request

Related issue

What does this PR do?

PR checklist

Uh oh!

nnethercott commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

some logic

why a new codec?

how are splits created with hamming distance

where this deviates with the source implementation

interpretation of the margin

hamming-bitwise-fast

Uh oh!

irevoire commented May 15, 2025

Uh oh!

nnethercott commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercott commented Jun 2, 2025

Uh oh!

nnethercott commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

nnethercott commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irevoire commented Jun 13, 2025

Uh oh!

nnethercott commented Jun 13, 2025

Uh oh!

Uh oh!

nnethercott commented Apr 21, 2025 •

edited

Loading

nnethercott commented May 16, 2025 •

edited

Loading

nnethercott commented Jun 10, 2025 •

edited

Loading

nnethercott commented Jun 12, 2025 •

edited

Loading