Skip to content

SIMD-optimize multi-bit RaBitQ inner product#4850

Closed
alibeklfc wants to merge 1 commit into
facebookresearch:mainfrom
alibeklfc:export-D94587233
Closed

SIMD-optimize multi-bit RaBitQ inner product#4850
alibeklfc wants to merge 1 commit into
facebookresearch:mainfrom
alibeklfc:export-D94587233

Conversation

@alibeklfc
Copy link
Copy Markdown
Contributor

Summary:
The multi-bit RaBitQ distance computation (compute_full_multibit_distance) previously extracted each code value bit-by-bit using extract_code_inline, which iterated ex_bits times per dimension — O(d × ex_bits) total with a data-dependent branch per bit.

This diff replaces it with two complementary optimizations:

1. Improved scalar extraction (all platforms):
Replaces the per-bit extraction loop with a 64-bit window read (memcpy + shift + mask) that extracts each code value in O(1) regardless of ex_bits. This alone gives 25–142% QPS improvement (higher gains at more bits).

2. SIMD bit-plane decomposition (AVX2 + BMI2):
Instead of extracting per-element multi-bit codes, decomposes the inner product into (1 + ex_bits) bit-plane dot products. Each plane is a float × bit-vector dot product computed via bit→mask→float conversion. For ex_bits == 1, both sign and ex are 1-bit packed, enabling zero-extraction kernels (AVX-512 and AVX2). For ex_bits 2–7, BMI2 PEXT extracts each bit plane in one instruction per 8 dimensions.

Also adds -mbmi2 to the AVX2 compiler flags in xplat.bzl.

Recall@10 is identical across all nb_bits before and after.

Differential Revision: D94587233

Summary:
The multi-bit RaBitQ distance computation (`compute_full_multibit_distance`) previously extracted each code value bit-by-bit using `extract_code_inline`, which iterated `ex_bits` times per dimension — O(d × ex_bits) total with a data-dependent branch per bit.

This diff replaces it with two complementary optimizations:

**1. Improved scalar extraction (all platforms):**
Replaces the per-bit extraction loop with a 64-bit window read (`memcpy` + shift + mask) that extracts each code value in O(1) regardless of `ex_bits`. This alone gives 25–142% QPS improvement (higher gains at more bits).

**2. SIMD bit-plane decomposition (AVX2 + BMI2):**
Instead of extracting per-element multi-bit codes, decomposes the inner product into `(1 + ex_bits)` bit-plane dot products. Each plane is a float × bit-vector dot product computed via bit→mask→float conversion. For `ex_bits == 1`, both sign and ex are 1-bit packed, enabling zero-extraction kernels (AVX-512 and AVX2). For `ex_bits` 2–7, BMI2 PEXT extracts each bit plane in one instruction per 8 dimensions.

Also adds `-mbmi2` to the AVX2 compiler flags in `xplat.bzl`.

Recall@10 is identical across all nb_bits before and after.

Differential Revision: D94587233
@meta-cla meta-cla Bot added the CLA Signed label Mar 2, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 2, 2026

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94587233.

@alexanderguzhva
Copy link
Copy Markdown
Contributor

@alibeklfc BMI2 is VERY slow on AMD Zen3 and below

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 3, 2026

This pull request has been merged in 8af77fe.

@AlSchlo
Copy link
Copy Markdown
Contributor

AlSchlo commented Mar 15, 2026

Hi @alibeklfc and @alexanderguzhva,

Does Faiss support BMI2 in the open-source build? I noticed BMI2 checks in the code guards, but I do not see it enabled in the public compile targets. I imagine xplat.bzl is a file internal to Meta?

The reason I ask is that I am experimenting with a feature that relies on PEXT (specifically for PQ with Panorama). For now I have been adding the -mbmi2 flag manually to enable it.

Would it be acceptable to add this flag to the OSS build configuration as well, or is there a reason it is intentionally omitted? What would be the best workaround?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants