Skip to content

Optimize multibit sign-bit unpacking in RaBitQ FastScan handlers#5097

Open
alibeklfc wants to merge 2 commits intofacebookresearch:mainfrom
alibeklfc:export-D100718832
Open

Optimize multibit sign-bit unpacking in RaBitQ FastScan handlers#5097
alibeklfc wants to merge 2 commits intofacebookresearch:mainfrom
alibeklfc:export-D100718832

Conversation

@alibeklfc
Copy link
Copy Markdown
Contributor

Summary:
Replace CodePackerRaBitQ::unpack_1() with rabitq_utils::unpack_sign_bits_from_packed() in both RaBitQHeapHandler and IVFRaBitQHeapHandler multibit refinement paths.

The old path called pq4_get_packed_element twice per output byte, each call recomputing the vector's in-block position from scratch (division, modulo, branches). The new function precomputes the PQ4 address once and iterates with simple strided byte loads. It also skips the unnecessary auxiliary data copy that unpack_1 performed.

Micro-benchmark results (unpack-only, median ns/call):

d Old (ns) New (ns) Speedup
64 627 166 3.8x
128 1204 279 4.3x
256 2329 525 4.4x
512 4583 996 4.6x
768 6731 1376 4.9x
1024 9344 1819 5.1x

End-to-end (unpack + SIMD distance) speedup is 1.4-1.6x.

Additional cleanup: removed CodePacker heap allocation and virtual dispatch from both handlers.

Differential Revision: D100718832

…cebookresearch#5095)

Summary:

D100399519 added IVFRaBitQSearchParameters support to the FastScan scanner
but only patched the distance_to_code fallback path. The main search path
(LUT construction and SIMD distance correction in handle()) still read
qb/centered from the index, ignoring the search params override.

This diff completes the fix by:
1. Adding qb/centered fields to FastScanDistancePostProcessing context
2. Threading them through compute_LUT → compute_residual_LUT
3. Reading them from context in the handler's handle() method
4. Extracting them from IVFRaBitQSearchParameters in search_preassigned

Differential Revision: D100674751
Summary:
Replace `CodePackerRaBitQ::unpack_1()` with `rabitq_utils::unpack_sign_bits_from_packed()` in both `RaBitQHeapHandler` and `IVFRaBitQHeapHandler` multibit refinement paths.

The old path called `pq4_get_packed_element` twice per output byte, each call recomputing the vector's in-block position from scratch (division, modulo, branches). The new function precomputes the PQ4 address once and iterates with simple strided byte loads. It also skips the unnecessary auxiliary data copy that `unpack_1` performed.

Micro-benchmark results (unpack-only, median ns/call):

| d    | Old (ns) | New (ns) | Speedup |
|------|----------|----------|---------|
| 64   | 627      | 166      | 3.8x    |
| 128  | 1204     | 279      | 4.3x    |
| 256  | 2329     | 525      | 4.4x    |
| 512  | 4583     | 996      | 4.6x    |
| 768  | 6731     | 1376     | 4.9x    |
| 1024 | 9344     | 1819     | 5.1x    |

End-to-end (unpack + SIMD distance) speedup is 1.4-1.6x.

Additional cleanup: removed `CodePacker` heap allocation and virtual dispatch from both handlers.

Differential Revision: D100718832
@meta-cla meta-cla bot added the CLA Signed label Apr 13, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 13, 2026

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100718832.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant