SIMD-optimize multi-bit RaBitQ inner product#4850
Conversation
Summary: The multi-bit RaBitQ distance computation (`compute_full_multibit_distance`) previously extracted each code value bit-by-bit using `extract_code_inline`, which iterated `ex_bits` times per dimension — O(d × ex_bits) total with a data-dependent branch per bit. This diff replaces it with two complementary optimizations: **1. Improved scalar extraction (all platforms):** Replaces the per-bit extraction loop with a 64-bit window read (`memcpy` + shift + mask) that extracts each code value in O(1) regardless of `ex_bits`. This alone gives 25–142% QPS improvement (higher gains at more bits). **2. SIMD bit-plane decomposition (AVX2 + BMI2):** Instead of extracting per-element multi-bit codes, decomposes the inner product into `(1 + ex_bits)` bit-plane dot products. Each plane is a float × bit-vector dot product computed via bit→mask→float conversion. For `ex_bits == 1`, both sign and ex are 1-bit packed, enabling zero-extraction kernels (AVX-512 and AVX2). For `ex_bits` 2–7, BMI2 PEXT extracts each bit plane in one instruction per 8 dimensions. Also adds `-mbmi2` to the AVX2 compiler flags in `xplat.bzl`. Recall@10 is identical across all nb_bits before and after. Differential Revision: D94587233
|
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94587233. |
|
@alibeklfc BMI2 is VERY slow on AMD Zen3 and below |
|
This pull request has been merged in 8af77fe. |
|
Hi @alibeklfc and @alexanderguzhva, Does Faiss support BMI2 in the open-source build? I noticed BMI2 checks in the code guards, but I do not see it enabled in the public compile targets. I imagine The reason I ask is that I am experimenting with a feature that relies on PEXT (specifically for PQ with Panorama). For now I have been adding the Would it be acceptable to add this flag to the OSS build configuration as well, or is there a reason it is intentionally omitted? What would be the best workaround? Thanks. |
Summary:
The multi-bit RaBitQ distance computation (
compute_full_multibit_distance) previously extracted each code value bit-by-bit usingextract_code_inline, which iteratedex_bitstimes per dimension — O(d × ex_bits) total with a data-dependent branch per bit.This diff replaces it with two complementary optimizations:
1. Improved scalar extraction (all platforms):
Replaces the per-bit extraction loop with a 64-bit window read (
memcpy+ shift + mask) that extracts each code value in O(1) regardless ofex_bits. This alone gives 25–142% QPS improvement (higher gains at more bits).2. SIMD bit-plane decomposition (AVX2 + BMI2):
Instead of extracting per-element multi-bit codes, decomposes the inner product into
(1 + ex_bits)bit-plane dot products. Each plane is a float × bit-vector dot product computed via bit→mask→float conversion. Forex_bits == 1, both sign and ex are 1-bit packed, enabling zero-extraction kernels (AVX-512 and AVX2). Forex_bits2–7, BMI2 PEXT extracts each bit plane in one instruction per 8 dimensions.Also adds
-mbmi2to the AVX2 compiler flags inxplat.bzl.Recall@10 is identical across all nb_bits before and after.
Differential Revision: D94587233