Skip to content

Speed up polysemous training with AVX-512.#4578

Open
mulugetam wants to merge 1 commit into
facebookresearch:mainfrom
mulugetam:polysemous-avx512
Open

Speed up polysemous training with AVX-512.#4578
mulugetam wants to merge 1 commit into
facebookresearch:mainfrom
mulugetam:polysemous-avx512

Conversation

@mulugetam
Copy link
Copy Markdown
Contributor

@mulugetam mulugetam commented Sep 10, 2025

This PR adds AVX-512 implementations of the four hot functions in polysemous training (compute_cost and cost_update for both ReproduceWithHammingObjective and ReproduceDistancesObjective), integrated via FAISS's SIMD dynamic dispatch framework. It speeds up the training phase by up to 1.09x.

Benchmarks

Training phase of benchs/bench_polysemous_sift1m.py on Sapphire Rapids (SPR):

$ numactl -m 0 -C 0-7 python benchs/bench_polysemous_sift1m.py
Build Median training time Speedup
Scalar (baseline) ~4.29 s 1.00x
AVX-512 (this PR) ~3.95 s 1.09x

Search accuracy and latency are unchanged — the optimization only affects the training path.

cc: @mdouze @subhadeepkaran

@meta-cla meta-cla Bot added the CLA Signed label Sep 10, 2025
@bshethmeta
Copy link
Copy Markdown
Contributor

@mnorris11 @subhadeepkaran Do you have enough context to review this?

@subhadeepkaran
Copy link
Copy Markdown

@mnorris11 @subhadeepkaran Do you have enough context to review this?

Yep, you can assign it to me. the change can be reviewed and merged post dynamic dispatch landing

@mulugetam
Copy link
Copy Markdown
Contributor Author

Refactored to use SIMD DD. Could you please review? @subhadeepkaran @mnorris11

@mulugetam mulugetam force-pushed the polysemous-avx512 branch 2 times, most recently from b11e6b6 to d47ba5a Compare February 19, 2026 00:33
@mulugetam mulugetam force-pushed the polysemous-avx512 branch from 9e4f4e8 to dd8066e Compare May 4, 2026 00:13
@mulugetam mulugetam force-pushed the polysemous-avx512 branch 2 times, most recently from 813f50e to 45a7921 Compare May 23, 2026 18:55
Add AVX-512 implementations of the compute_cost and cost_update hot
loops for both ReproduceWithHammingObjective and
ReproduceDistancesObjective. The vectorized paths use 512-bit packed
double FMA, masked blends for branchless swap handling, and a portable
popcnt_512 helper that uses _mm512_popcnt_epi64 when AVX512VPOPCNTDQ
is available or falls back to a nibble-lookup approach.

Dispatch is guarded by COMPILE_SIMD_AVX512 and the SIMD dynamic
dispatch level, falling back to the existing scalar code with zero
overhead on non-AVX-512 systems.

Benchmarks of the training phase on SIFT1M (bench_polysemous_sift1m.py)
show ~1.09x speedup over the scalar path on Sapphire Rapids.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@mulugetam
Copy link
Copy Markdown
Contributor Author

@mnorris11 Rebased with minor changes.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 23, 2026

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106200518.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants