Speed up polysemous training with AVX-512. by mulugetam · Pull Request #4578 · facebookresearch/faiss

mulugetam · 2025-09-10T18:23:18Z

This PR adds AVX-512 implementations of the four hot functions in polysemous training (compute_cost and cost_update for both ReproduceWithHammingObjective and ReproduceDistancesObjective), integrated via FAISS's SIMD dynamic dispatch framework. It speeds up the training phase by up to 1.09x.

Benchmarks

Training phase of benchs/bench_polysemous_sift1m.py on Sapphire Rapids (SPR):

$ numactl -m 0 -C 0-7 python benchs/bench_polysemous_sift1m.py

Build	Median training time	Speedup
Scalar (baseline)	~4.29 s	1.00x
AVX-512 (this PR)	~3.95 s	1.09x

Search accuracy and latency are unchanged — the optimization only affects the training path.

cc: @mdouze @subhadeepkaran

bshethmeta · 2025-09-11T18:48:36Z

@mnorris11 @subhadeepkaran Do you have enough context to review this?

subhadeepkaran · 2025-09-15T07:28:23Z

@mnorris11 @subhadeepkaran Do you have enough context to review this?

Yep, you can assign it to me. the change can be reviewed and merged post dynamic dispatch landing

mulugetam · 2026-02-18T20:43:17Z

Refactored to use SIMD DD. Could you please review? @subhadeepkaran @mnorris11

Add AVX-512 implementations of the compute_cost and cost_update hot loops for both ReproduceWithHammingObjective and ReproduceDistancesObjective. The vectorized paths use 512-bit packed double FMA, masked blends for branchless swap handling, and a portable popcnt_512 helper that uses _mm512_popcnt_epi64 when AVX512VPOPCNTDQ is available or falls back to a nibble-lookup approach. Dispatch is guarded by COMPILE_SIMD_AVX512 and the SIMD dynamic dispatch level, falling back to the existing scalar code with zero overhead on non-AVX-512 systems. Benchmarks of the training phase on SIFT1M (bench_polysemous_sift1m.py) show ~1.09x speedup over the scalar path on Sapphire Rapids. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

mulugetam · 2026-05-23T19:51:14Z

@mnorris11 Rebased with minor changes.

meta-codesync · 2026-05-23T21:32:02Z

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106200518.

meta-cla Bot added the CLA Signed label Sep 10, 2025

mnorris11 assigned subhadeepkaran Oct 6, 2025

mnorris11 added the Implementation label Oct 6, 2025

mnorris11 added the backlog label Dec 8, 2025

mulugetam force-pushed the polysemous-avx512 branch from ab2eac6 to acf24df Compare February 18, 2026 18:07

mulugetam force-pushed the polysemous-avx512 branch 2 times, most recently from b11e6b6 to d47ba5a Compare February 19, 2026 00:33

mulugetam force-pushed the polysemous-avx512 branch from 9e4f4e8 to dd8066e Compare May 4, 2026 00:13

mulugetam force-pushed the polysemous-avx512 branch 2 times, most recently from 813f50e to 45a7921 Compare May 23, 2026 18:55

mnorris11 added the to-benchmark label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up polysemous training with AVX-512.#4578

Speed up polysemous training with AVX-512.#4578
mulugetam wants to merge 1 commit into
facebookresearch:mainfrom
mulugetam:polysemous-avx512

mulugetam commented Sep 10, 2025 •

edited

Loading

Uh oh!

bshethmeta commented Sep 11, 2025

Uh oh!

subhadeepkaran commented Sep 15, 2025

Uh oh!

mulugetam commented Feb 18, 2026

Uh oh!

mulugetam commented May 23, 2026

Uh oh!

meta-codesync Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mulugetam commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

bshethmeta commented Sep 11, 2025

Uh oh!

subhadeepkaran commented Sep 15, 2025

Uh oh!

mulugetam commented Feb 18, 2026

Uh oh!

mulugetam commented May 23, 2026

Uh oh!

meta-codesync Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mulugetam commented Sep 10, 2025 •

edited

Loading