Skip to content

Implement Hamming distance for binary strings #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
7b45b21
add zero-one binary codec
nnethercott Apr 21, 2025
ad8d2cf
update instructions for to_vec_neon
nnethercott Apr 21, 2025
e541ed8
implement hamming distance
nnethercott Apr 21, 2025
217a982
update tests
nnethercott Apr 21, 2025
18aeda2
remove copy pasta
nnethercott Apr 23, 2025
870fa98
cargo fmt
nnethercott Apr 25, 2025
1bfe320
Override side impl for Hamming
nnethercott May 16, 2025
be52d8e
add pq_distance for hamming
nnethercott May 19, 2025
9b5d161
make normalized_distance like in annoy
nnethercott May 19, 2025
bd90930
oops untrack benches
nnethercott May 19, 2025
bc9ed83
replace binary heap with median-based top k
nnethercott May 20, 2025
e620482
style: rename function
nnethercott May 20, 2025
e0fb20c
clean up helper
nnethercott May 21, 2025
1076d9e
Update rustc to 1.82
irevoire May 15, 2025
7860c28
handle none as a valid splitplane normal
irevoire May 15, 2025
e2abf8b
add an upgrade function
irevoire May 19, 2025
b174c8d
add a test on the dumpless upgrade
irevoire May 20, 2025
29f2725
fmt
irevoire May 20, 2025
49ac5f6
max to min
nnethercott May 22, 2025
b96afab
add trait bounds to median-based top k
nnethercott May 23, 2025
5d9d4ad
add proptest
nnethercott May 25, 2025
35d39e2
rename helper function
nnethercott May 25, 2025
c8b7edb
fix range in proptest
nnethercott May 26, 2025
40d5484
update the writer
irevoire May 21, 2025
219b914
Update the version in the cargo toml
irevoire May 21, 2025
a2604e4
Handle the new kind of nodes in the reader as well
irevoire May 21, 2025
014d3ac
implement the upgrade process with the new split nodes
irevoire May 21, 2025
27032cc
fmt
irevoire May 21, 2025
4e4cfb8
improve the upgrade test a lot
irevoire May 21, 2025
1ed168c
make clippy happy
irevoire May 21, 2025
41b4103
Update src/node.rs
irevoire May 22, 2025
f8444eb
apply review comment
irevoire May 22, 2025
24ee88d
Add a failing test triggering the bug when we don't update all the sp…
irevoire May 22, 2025
22544a4
Fix the broken update and test
irevoire May 22, 2025
2440418
Add link to quickwit
nnethercott May 26, 2025
9e4997f
apply review suggestions, simplify top k fn
nnethercott May 26, 2025
27663eb
feat: make normal a Leaf<'_, Distance>
nnethercott May 29, 2025
61f2cc6
kill margin_no_header
nnethercott May 29, 2025
b021b42
feat: add quantized bias
nnethercott May 30, 2025
87e0dfe
cargo fmt
nnethercott May 31, 2025
b09d689
bump tests
nnethercott May 31, 2025
de76c36
add debug impls for all node headers
nnethercott May 31, 2025
6b3459a
revert clippy fix for this line
nnethercott May 31, 2025
ae043a1
reupdate snapshots after new debug impl
nnethercott May 31, 2025
63b5426
Merge branch 'add-header-to-normal' into hamming-with-header
nnethercott Jun 2, 2025
05063f8
move sampled bit to header
nnethercott Jun 2, 2025
4e79990
feat: modify margin, reuse mod::side and mod::pq_distance
nnethercott Jun 2, 2025
bb71c68
don't expand vec in margin
nnethercott Jun 2, 2025
84d3f80
Merge branch 'main' into hamming-with-header
nnethercott Jun 10, 2025
57ca678
rebase on main
nnethercott Jun 10, 2025
039ecab
appease the almighty clippy gods
nnethercott Jun 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions proptest-regressions/unaligned_vector/binary_test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Seeds for failure cases proptest has generated in the past. It is
# automatically read and these particular cases re-run before any
# novel cases are generated.
#
# It is recommended to check this file in to source control so that
# everyone who runs the test benefits from these saved cases.
cc 4044d46b46fbadeb98b1c37a1b7ec57f00fe21008401479d30f2f39aa83e3195 # shrinks to original = [0.0]
148 changes: 148 additions & 0 deletions src/distance/hamming.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
use std::fmt;

use crate::distance::Distance;
use crate::node::Leaf;
use crate::parallel::ImmutableSubsetLeafs;
use crate::unaligned_vector::{Binary, UnalignedVector};
use bytemuck::{Pod, Zeroable};
use rand::Rng;

/// The Hamming distance between two vectors is the number of positions at
/// which the corresponding symbols are different.
///
/// `d(u,v) = ||u ^ v||₁`
///
/// /!\ This distance function is binary, which means it loses all its precision
/// and their scalar values are converted to `0` or `1` under the rule
/// `x > 0.0 => 1`, otherwise `0`
#[derive(Debug, Clone)]
pub enum Hamming {}

/// The header of BinaryEuclidean leaf nodes.
#[repr(C)]
#[derive(Pod, Zeroable, Clone, Copy)]
pub struct NodeHeaderHamming {
idx: usize,
}
impl fmt::Debug for NodeHeaderHamming {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_struct("NodeHeaderHamming ").field("idx", &format!("{}", self.idx)).finish()
}
}

impl Distance for Hamming {
const DEFAULT_OVERSAMPLING: usize = 3;

type Header = NodeHeaderHamming;
type VectorCodec = Binary;

fn name() -> &'static str {
"hamming"
}

fn new_header(_vector: &UnalignedVector<Self::VectorCodec>) -> Self::Header {
NodeHeaderHamming { idx: 0 }
}

fn built_distance(p: &Leaf<Self>, q: &Leaf<Self>) -> f32 {
hamming_bitwise_fast(p.vector.as_bytes(), q.vector.as_bytes())
}

fn normalized_distance(d: f32, _: usize) -> f32 {
d
}

fn norm_no_header(v: &UnalignedVector<Self::VectorCodec>) -> f32 {
v.as_bytes().iter().map(|b| b.count_ones() as i32).sum::<i32>() as f32
}

fn init(_node: &mut Leaf<Self>) {}

fn create_split<'a, R: Rng>(
children: &'a ImmutableSubsetLeafs<Self>,
rng: &mut R,
) -> heed::Result<Leaf<'a, Self>> {
// unlike other distances which build a seperating hyperplane we
// construct an LSH by bit sampling and storing the splitting index
// in the node header.
// https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Bit_sampling_for_Hamming_distance

const ITERATION_STEPS: usize = 200;

let is_valid_split = |n: &Leaf<'a, Self>, rng: &mut R| {
let mut count = 0;
for _ in 0..ITERATION_STEPS {
let u = children.choose(rng)?.unwrap();
if <Self as Distance>::margin(n, &u).is_sign_positive() {
count += 1;
}
}
Ok::<bool, heed::Error>(count > 0 && count < ITERATION_STEPS)
};

// first try random index
let dim = children.choose(rng)?.unwrap().vector.len();
let idx = rng.gen_range(0..dim);
let mut normal =
Leaf { header: NodeHeaderHamming { idx }, vector: UnalignedVector::from_vec(vec![]) };

if is_valid_split(&normal, rng)? {
return Ok(normal);
}

// otherwise brute-force search for a splitting coordinate
for j in 0..dim {
normal.header.idx = j;
if is_valid_split(&normal, rng)? {
return Ok(normal);
}
}

// fallback
Ok(normal)
}

fn margin(n: &Leaf<Self>, q: &Leaf<Self>) -> f32 {
let v = q.vector.as_bytes();
let byte = n.header.idx / 8;
let bit = n.header.idx % 8;
if (v[byte] >> bit) & 1 == 1 {
1.0
} else {
-1.0
}
}
}

#[inline]
pub fn hamming_bitwise_fast(u: &[u8], v: &[u8]) -> f32 {
// based on : https://github.com/emschwartz/hamming-bitwise-fast
// Explicitly structuring the code as below lends itself to SIMD optimizations by
// the compiler -> https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html
assert_eq!(u.len(), v.len());

type BitPackedWord = u64;
const CHUNK_SIZE: usize = std::mem::size_of::<BitPackedWord>();

let mut distance = u
.chunks_exact(CHUNK_SIZE)
.zip(v.chunks_exact(CHUNK_SIZE))
.map(|(u_chunk, v_chunk)| {
let u_val = BitPackedWord::from_ne_bytes(u_chunk.try_into().unwrap());
let v_val = BitPackedWord::from_ne_bytes(v_chunk.try_into().unwrap());
(u_val ^ v_val).count_ones()
})
.sum::<u32>();

if u.len() % CHUNK_SIZE != 0 {
distance += u
.chunks_exact(CHUNK_SIZE)
.remainder()
.iter()
.zip(v.chunks_exact(CHUNK_SIZE).remainder())
.map(|(u_byte, v_byte)| (u_byte ^ v_byte).count_ones())
.sum::<u32>();
}

distance as f32
}
2 changes: 2 additions & 0 deletions src/distance/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ use bytemuck::{Pod, Zeroable};
pub use cosine::{Cosine, NodeHeaderCosine};
pub use dot_product::{DotProduct, NodeHeaderDotProduct};
pub use euclidean::{Euclidean, NodeHeaderEuclidean};
pub use hamming::{Hamming, NodeHeaderHamming};
use heed::{RwPrefix, RwTxn};
pub use manhattan::{Manhattan, NodeHeaderManhattan};
use rand::Rng;
Expand All @@ -27,6 +28,7 @@ mod binary_quantized_manhattan;
mod cosine;
mod dot_product;
mod euclidean;
mod hamming;
mod manhattan;

fn new_leaf<D: Distance>(vec: Vec<f32>) -> Leaf<'static, D> {
Expand Down
4 changes: 2 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ pub mod internals {
pub use crate::distance::{
NodeHeaderBinaryQuantizedCosine, NodeHeaderBinaryQuantizedEuclidean,
NodeHeaderBinaryQuantizedManhattan, NodeHeaderCosine, NodeHeaderDotProduct,
NodeHeaderEuclidean, NodeHeaderManhattan,
NodeHeaderEuclidean, NodeHeaderHamming, NodeHeaderManhattan,
};
pub use crate::key::KeyCodec;
pub use crate::node::{Leaf, NodeCodec};
Expand Down Expand Up @@ -145,7 +145,7 @@ pub mod internals {
pub mod distances {
pub use crate::distance::{
BinaryQuantizedCosine, BinaryQuantizedEuclidean, BinaryQuantizedManhattan, Cosine,
DotProduct, Euclidean, Manhattan,
DotProduct, Euclidean, Hamming, Manhattan,
};
}

Expand Down
55 changes: 55 additions & 0 deletions src/tests/binary.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
use crate::{
distance::Hamming,
tests::{create_database, rng},
Writer,
};

#[test]
fn write_and_retrieve_binary_vector() {
let handle = create_database::<Hamming>();
let mut wtxn = handle.env.write_txn().unwrap();
let writer = Writer::new(handle.database, 0, 16);
writer
.add_item(
&mut wtxn,
0,
&[
-2.0, -1.0, 0.0, -0.1, 2.0, 2.0, -12.4, 21.2, -2.0, -1.0, 0.0, 1.0, 2.0, 2.0,
-12.4, 21.2,
],
)
.unwrap();
let vec = writer.item_vector(&wtxn, 0).unwrap().unwrap();
insta::assert_debug_snapshot!(vec, @r###"
[
0.0,
0.0,
0.0,
0.0,
1.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
1.0,
1.0,
0.0,
1.0,
]
"###);

writer.builder(&mut rng()).n_trees(1).build(&mut wtxn).unwrap();
wtxn.commit().unwrap();

insta::assert_snapshot!(handle, @r#"
==================
Dumping index 0
Root: Metadata { dimensions: 16, items: RoaringBitmap<[0]>, roots: [0], distance: "hamming" }
Version: Version { major: 0, minor: 7, patch: 0 }
Tree 0: Descendants(Descendants { descendants: [0] })
Item 0: Leaf(Leaf { header: NodeHeaderHamming { idx: "0" }, vector: [0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, "other ..."] })
"#);
}
1 change: 1 addition & 0 deletions src/tests/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ use tempfile::TempDir;
use crate::version::VersionCodec;
use crate::{Database, Distance, MetadataCodec, NodeCodec, NodeMode, Reader};

mod binary;
mod binary_quantized;
mod reader;
mod upgrade;
Expand Down
Loading
Loading