Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP) by mulugetam · Pull Request #5173 · facebookresearch/faiss

mulugetam · 2026-05-01T23:50:18Z

This PR specializes the byte-vector distance path for AVX512_SPR on QT_8bit_direct and QT_8bit_direct_signed, achieving speedups of up to 2.3x on distance benchmarks and up to 1.21x on search (IndexIVFScalarQuantizer) compared to the existing SPR implementation.

Inner product: A 64-byte VNNI loop using _mm512_dpbusd_epi32 replaces the 16-byte cvtepu8_epi32 + mullo_epi32 path. For the unsigned×unsigned case, the operand is biased by −128; a closed-form correction using sum(a) and sum(b) (accumulated via _mm512_sad_epu8) restores the exact result. The signed variant applies the same correction terms.

L2 distance: A 64-byte widened-multiply-add loop via _mm512_cvtepu8_epi16 + _mm512_madd_epi16 replaces the narrower path. The signed variant is bit-exact because the −128 bias cancels in the difference.

Improves upon and supersedes #5067.

Distance benchmark

bench_scalar_quantizer_distance --d=<128|256|768> --n=2000 --iterations=20

Quantizer	d	Baseline (ms)	This PR (ms)	Speedup
`QT_8bit_direct`	128	40.8	22.9	1.78×
`QT_8bit_direct_signed`	128	35.0	23.0	1.52×
`QT_8bit_direct`	256	75.3	35.7	2.11×
`QT_8bit_direct_signed`	256	69.1	35.7	1.94×
`QT_8bit_direct`	768	215.5	92.3	2.33×
`QT_8bit_direct_signed`	768	206.4	91.0	2.27×

Raw performance data: https://gist.github.com/mulugetam/72c0960e47bc640f99aa346f363e56fe

End-to-end search (`IndexIVFScalarQuantizer`)

python benchs/bench_scalar_quantizer.py

Range stat	`QT_8bit_direct`	`QT_8bit_direct_signed`
RS_minmax	1.10×	1.12×
RS_minmax	1.09×	1.15×
RS_minmax	1.14×	1.13×
RS_minmax	1.15×	1.07×
RS_minmax	1.06×	1.17×
RS_minmax	1.09×	1.02×
RS_minmax	1.09×	1.11×
RS_meanstd	1.21×	1.13×
RS_meanstd	1.14×	1.06×
RS_meanstd	1.11×	1.07×
RS_meanstd	1.12×	1.07×
RS_meanstd	1.15×	1.10×
RS_meanstd	1.13×	1.04×
RS_meanstd	1.04×	1.07×
RS_quantiles	1.11×	1.03×
RS_quantiles	1.12×	1.11×
RS_quantiles	1.11×	0.98×
RS_quantiles	1.14×	1.17×
RS_optim	1.06×	1.07×

Raw performance data: https://gist.github.com/mulugetam/632e2e08c9358b2184cbaa3397a6c73f

mdouze · 2026-05-04T07:26:45Z

Thanks for the PR.
Do I understand correctly that the results are not exactly the same since this is based on an integer-integer comparison instead of integer-float?
In fact this is a tradeoff that we may want to apply to other scalar quantizers as well, especially the QT_x_uniform ones.

mulugetam · 2026-05-04T16:33:32Z

@mdouze Thanks for the review.

The results are bit-exact for QT_8bit_direct: both the existing AVX512 code and this PR operate in the integer domain via DistanceComputerByte, which truncates the query to uint8 in set_query(). Thiss PR simply widens the loop from 16 to 64 bytes per iteration using VNNI, with an algebraically exact bias correction for the unsigned IP case.

For QT_8bit_direct_signed, my reading of the code is that it has no DistanceComputerByte variant for the signed type and it just falls through to the float-domain DCTemplate<Quantizer8bitDirectSigned> path, which preserves full float precision in the query. The new DistanceComputerByteSigned truncates the query to integer (uint8_t(int(x[i]) + 128)), losing the fractional part. As stated above, this truncation loss already exists for QT_8bit_direct, but it's new for QT_8bit_direct_signed.

I can document this as a known tradeoff (speed vs. precision for QT_8bit_direct_signed). Alternatively, I could only enable it for the symmetric compute_code_distance path where both inputs are already integer (but this means losing the benefit of VNNI for query_to_code path).

Regarding QT_x_uniform: the VNNI byte-domain technique doesn't directly apply there since, from my reading of the code, those quantizers require scale/offset denormalization.

Quick question: are the DD changes complete, or still in progress? I'd like to re-review and rebase all my other pending PRs (I believe there are 9) once they land. Thanks!

mdouze · 2026-05-05T12:15:35Z

I think that the QT_8bit_direct_signed can be performed in the integer domain, that's the initial purpose of 8bit_direct_x
Yes the DD is considered complete since yesterday.
Thanks for all the SIMD optimization efforts! We will try to ingest them as soon as possible.

mulugetam · 2026-05-20T17:13:53Z

@mdouze Rebased with fixes.

Adds an AVX512_SPR specialization path for ScalarQuantizer that uses Sapphire Rapids-specific instructions for byte-code distance computation on QT_8bit_direct and QT_8bit_direct_signed. Inner product (8-bit codes): Replaces the AVX512 path that processes 16 bytes per iteration via cvtepu8_epi32 + mullo_epi32 with a VNNI loop that processes 64 bytes per iteration using _mm512_dpbusd_epi32. VNNI computes unsigned*signed dot products, so the standard bias trick is used to bridge unsigned*unsigned: subtract 128 from code2, run dpbusd, then add the 128 * sum(code1) correction. A scalar tail handles d % 64. For QT_8bit_direct_signed (storage = value + 128), the same VNNI loop runs and an additional closed-form correction is applied: (a-128) * (b-128) = a*b - 128*(a+b) + 16384 sum(a) and sum(b) are accumulated cheaply via _mm512_sad_epu8 (one PSADBW per 64-byte iteration). L2 (8-bit codes): Replaces the 16-bytes-per-iter cvtepu8_epi32 + sub + mullo_epi32 path with a 16-bit pipeline: load 64 bytes, zero-extend to 16-bit lanes via _mm512_cvtepu8_epi16, subtract in 16-bit, square-and-accumulate to 32-bit with _mm512_madd_epi16. Squared differences of two uint8_t values fit in 16 bits (max 255^2 = 65025), so the widened representation is safe. Falls through to a 32-byte step and a scalar tail for arbitrary d. The same kernel is bit-exact for the signed variant: (a - 128) - (b - 128) == a - b, so no correction is needed. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

meta-codesync · 2026-05-22T23:59:17Z

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148661.

meta-cla Bot added the CLA Signed label May 1, 2026

mnorris11 added the to-benchmark label May 18, 2026

mulugetam force-pushed the sq-avx512-spr-opt branch from 09cf1fb to 0bd934e Compare May 20, 2026 17:12

mulugetam force-pushed the sq-avx512-spr-opt branch from 0bd934e to 9e5c346 Compare May 20, 2026 20:55

mulugetam and others added 2 commits May 20, 2026 20:56

Merge branch 'main' into sq-avx512-spr-opt

58428ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173

Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173
mulugetam wants to merge 2 commits into
facebookresearch:mainfrom
mulugetam:sq-avx512-spr-opt

mulugetam commented May 1, 2026 •

edited

Loading

Uh oh!

mdouze commented May 4, 2026

Uh oh!

mulugetam commented May 4, 2026

Uh oh!

mdouze commented May 5, 2026

Uh oh!

mulugetam commented May 20, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mulugetam commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Distance benchmark

End-to-end search (IndexIVFScalarQuantizer)

Uh oh!

mdouze commented May 4, 2026

Uh oh!

mulugetam commented May 4, 2026

Uh oh!

mdouze commented May 5, 2026

Uh oh!

mulugetam commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mulugetam commented May 1, 2026 •

edited

Loading

End-to-end search (`IndexIVFScalarQuantizer`)

mulugetam commented May 20, 2026 •

edited

Loading