Convert hamming.cpp public API to dynamic dispatch (#5070)#5070
Open
algoriddle wants to merge 1 commit intofacebookresearch:mainfrom
Open
Convert hamming.cpp public API to dynamic dispatch (#5070)#5070algoriddle wants to merge 1 commit intofacebookresearch:mainfrom
algoriddle wants to merge 1 commit intofacebookresearch:mainfrom
Conversation
Contributor
|
@algoriddle has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99994593. |
…5070) Summary: Convert all 8 public Hamming distance functions in `hamming.cpp` to dynamic dispatch so they use the correct SIMD implementation at runtime instead of always running at scalar speed in DD mode. This is Phase 1 of the Hamming DD conversion. It converts the public API functions in `hamming.cpp`; Phase 2 (future diff) will migrate external callers that use `dispatch_HammingComputer` directly. **Why Hamming is harder than previous DD conversions:** The SIMD code lives inside 11 HammingComputer/GenHammingComputer structs with ISA-specific memory layouts (e.g., NEON `HammingComputer16` stores `uint8x16_t a0` vs x86 `uint64_t a0, a1`). The structs use raw intrinsics (not simdlib) and are template parameters in hot inner loops where `.hamming()` must be inlined. **Approach:** - Extract all template code from `hamming.cpp` into a shared header `hamming_impl.h`, compiled once per ISA TU (AVX2, NEON, NONE). - The `hamdis-inl.h` ISA ladder (`#ifdef __AVX2__` etc.) selects the correct struct definitions based on per-TU compiler flags. - All internal templates live in an anonymous namespace for ODR safety (different TUs see different struct layouts but never share instances). - 8 entry point template specializations (`_dispatch<SL>`) provide the external linkage symbols that the dispatch wrappers call. - `hamming.cpp` includes `hamming_impl.h` with `THE_SIMD_LEVEL=NONE` for the scalar fallback, then dispatches via `with_simd_level_256bit`. - `hamming.h` keeps including `hamdis-inl.h` for backward compatibility with external callers (they continue to work with generic structs). **Functions dispatched:** - `hammings_knn_hc` (binary flat/IVF kNN search) - `hammings_knn_mc` (max-count kNN) - `hamming_range_search` - `hammings` (all-pairs distance) - `generalized_hammings_knn_hc` (byte-level Hamming) - `hamming_count_thres`, `crosshamming_count_thres`, `match_hamming_thres` **Performance impact:** - NEON: HammingComputer16/20/32/64 gain real vectorization (`vcntq_u8`) - AVX2: GenHammingComputer16/32/M8 gain SSE2/AVX2 byte comparison - AVX2: All HammingComputers get hardware `popcnt` (implied by `-mpopcnt`) Differential Revision: D99994593
549c200 to
9e0b312
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Convert all 8 public Hamming distance functions in
hamming.cpptodynamic dispatch so they use the correct SIMD implementation at runtime
instead of always running at scalar speed in DD mode.
This is Phase 1 of the Hamming DD conversion. It converts the public API
functions in
hamming.cpp; Phase 2 (future diff) will migrate externalcallers that use
dispatch_HammingComputerdirectly.Why Hamming is harder than previous DD conversions:
The SIMD code lives inside 11 HammingComputer/GenHammingComputer structs
with ISA-specific memory layouts (e.g., NEON
HammingComputer16storesuint8x16_t a0vs x86uint64_t a0, a1). The structs use rawintrinsics (not simdlib) and are template parameters in hot inner loops
where
.hamming()must be inlined.Approach:
hamming.cppinto a shared headerhamming_impl.h, compiled once per ISA TU (AVX2, NEON, NONE).hamdis-inl.hISA ladder (#ifdef __AVX2__etc.) selects thecorrect struct definitions based on per-TU compiler flags.
(different TUs see different struct layouts but never share instances).
_dispatch<SL>) provide theexternal linkage symbols that the dispatch wrappers call.
hamming.cppincludeshamming_impl.hwithTHE_SIMD_LEVEL=NONEfor the scalar fallback, then dispatches via
with_simd_level_256bit.hamming.hkeeps includinghamdis-inl.hfor backward compatibilitywith external callers (they continue to work with generic structs).
Functions dispatched:
hammings_knn_hc(binary flat/IVF kNN search)hammings_knn_mc(max-count kNN)hamming_range_searchhammings(all-pairs distance)generalized_hammings_knn_hc(byte-level Hamming)hamming_count_thres,crosshamming_count_thres,match_hamming_thresPerformance impact:
vcntq_u8)popcnt(implied by-mpopcnt)Differential Revision: D99994593