Speed up polysemous training with AVX-512.#4578
Open
mulugetam wants to merge 1 commit into
Open
Conversation
Contributor
|
@mnorris11 @subhadeepkaran Do you have enough context to review this? |
Yep, you can assign it to me. the change can be reviewed and merged post dynamic dispatch landing |
ab2eac6 to
acf24df
Compare
Contributor
Author
|
Refactored to use SIMD DD. Could you please review? @subhadeepkaran @mnorris11 |
b11e6b6 to
d47ba5a
Compare
9e4f4e8 to
dd8066e
Compare
813f50e to
45a7921
Compare
Add AVX-512 implementations of the compute_cost and cost_update hot loops for both ReproduceWithHammingObjective and ReproduceDistancesObjective. The vectorized paths use 512-bit packed double FMA, masked blends for branchless swap handling, and a portable popcnt_512 helper that uses _mm512_popcnt_epi64 when AVX512VPOPCNTDQ is available or falls back to a nibble-lookup approach. Dispatch is guarded by COMPILE_SIMD_AVX512 and the SIMD dynamic dispatch level, falling back to the existing scalar code with zero overhead on non-AVX-512 systems. Benchmarks of the training phase on SIFT1M (bench_polysemous_sift1m.py) show ~1.09x speedup over the scalar path on Sapphire Rapids. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
Contributor
Author
|
@mnorris11 Rebased with minor changes. |
Contributor
|
@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106200518. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds AVX-512 implementations of the four hot functions in polysemous training (
compute_costandcost_updatefor bothReproduceWithHammingObjectiveandReproduceDistancesObjective), integrated via FAISS's SIMD dynamic dispatch framework. It speeds up the training phase by up to 1.09x.Benchmarks
Training phase of
benchs/bench_polysemous_sift1m.pyon Sapphire Rapids (SPR):Search accuracy and latency are unchanged — the optimization only affects the training path.
cc: @mdouze @subhadeepkaran