Skip to content

Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173

Open
mulugetam wants to merge 2 commits into
facebookresearch:mainfrom
mulugetam:sq-avx512-spr-opt
Open

Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173
mulugetam wants to merge 2 commits into
facebookresearch:mainfrom
mulugetam:sq-avx512-spr-opt

Conversation

@mulugetam
Copy link
Copy Markdown
Contributor

@mulugetam mulugetam commented May 1, 2026

This PR specializes the byte-vector distance path for AVX512_SPR on QT_8bit_direct and QT_8bit_direct_signed, achieving speedups of up to 2.3x on distance benchmarks and up to 1.21x on search (IndexIVFScalarQuantizer) compared to the existing SPR implementation.

Inner product: A 64-byte VNNI loop using _mm512_dpbusd_epi32 replaces the 16-byte cvtepu8_epi32 + mullo_epi32 path. For the unsigned×unsigned case, the operand is biased by −128; a closed-form correction using sum(a) and sum(b) (accumulated via _mm512_sad_epu8) restores the exact result. The signed variant applies the same correction terms.

L2 distance: A 64-byte widened-multiply-add loop via _mm512_cvtepu8_epi16 + _mm512_madd_epi16 replaces the narrower path. The signed variant is bit-exact because the −128 bias cancels in the difference.

Improves upon and supersedes #5067.

Distance benchmark

bench_scalar_quantizer_distance --d=<128|256|768> --n=2000 --iterations=20
Quantizer d Baseline (ms) This PR (ms) Speedup
QT_8bit_direct 128 40.8 22.9 1.78×
QT_8bit_direct_signed 128 35.0 23.0 1.52×
QT_8bit_direct 256 75.3 35.7 2.11×
QT_8bit_direct_signed 256 69.1 35.7 1.94×
QT_8bit_direct 768 215.5 92.3 2.33×
QT_8bit_direct_signed 768 206.4 91.0 2.27×

Raw performance data: https://gist.github.com/mulugetam/72c0960e47bc640f99aa346f363e56fe

End-to-end search (IndexIVFScalarQuantizer)

python benchs/bench_scalar_quantizer.py
Range stat QT_8bit_direct QT_8bit_direct_signed
RS_minmax 1.10× 1.12×
RS_minmax 1.09× 1.15×
RS_minmax 1.14× 1.13×
RS_minmax 1.15× 1.07×
RS_minmax 1.06× 1.17×
RS_minmax 1.09× 1.02×
RS_minmax 1.09× 1.11×
RS_meanstd 1.21× 1.13×
RS_meanstd 1.14× 1.06×
RS_meanstd 1.11× 1.07×
RS_meanstd 1.12× 1.07×
RS_meanstd 1.15× 1.10×
RS_meanstd 1.13× 1.04×
RS_meanstd 1.04× 1.07×
RS_quantiles 1.11× 1.03×
RS_quantiles 1.12× 1.11×
RS_quantiles 1.11× 0.98×
RS_quantiles 1.14× 1.17×
RS_optim 1.06× 1.07×

Raw performance data: https://gist.github.com/mulugetam/632e2e08c9358b2184cbaa3397a6c73f

@meta-cla meta-cla Bot added the CLA Signed label May 1, 2026
@mdouze
Copy link
Copy Markdown
Contributor

mdouze commented May 4, 2026

Thanks for the PR.
Do I understand correctly that the results are not exactly the same since this is based on an integer-integer comparison instead of integer-float?
In fact this is a tradeoff that we may want to apply to other scalar quantizers as well, especially the QT_x_uniform ones.

@mulugetam
Copy link
Copy Markdown
Contributor Author

@mdouze Thanks for the review.

The results are bit-exact for QT_8bit_direct: both the existing AVX512 code and this PR operate in the integer domain via DistanceComputerByte, which truncates the query to uint8 in set_query(). Thiss PR simply widens the loop from 16 to 64 bytes per iteration using VNNI, with an algebraically exact bias correction for the unsigned IP case.

For QT_8bit_direct_signed, my reading of the code is that it has no DistanceComputerByte variant for the signed type and it just falls through to the float-domain DCTemplate<Quantizer8bitDirectSigned> path, which preserves full float precision in the query. The new DistanceComputerByteSigned truncates the query to integer (uint8_t(int(x[i]) + 128)), losing the fractional part. As stated above, this truncation loss already exists for QT_8bit_direct, but it's new for QT_8bit_direct_signed.

I can document this as a known tradeoff (speed vs. precision for QT_8bit_direct_signed). Alternatively, I could only enable it for the symmetric compute_code_distance path where both inputs are already integer (but this means losing the benefit of VNNI for query_to_code path).

Regarding QT_x_uniform: the VNNI byte-domain technique doesn't directly apply there since, from my reading of the code, those quantizers require scale/offset denormalization.

Quick question: are the DD changes complete, or still in progress? I'd like to re-review and rebase all my other pending PRs (I believe there are 9) once they land. Thanks!

@mdouze
Copy link
Copy Markdown
Contributor

mdouze commented May 5, 2026

I think that the QT_8bit_direct_signed can be performed in the integer domain, that's the initial purpose of 8bit_direct_x
Yes the DD is considered complete since yesterday.
Thanks for all the SIMD optimization efforts! We will try to ingest them as soon as possible.

@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented May 20, 2026

@mdouze Rebased with fixes.

@mulugetam mulugetam force-pushed the sq-avx512-spr-opt branch from 0bd934e to 9e5c346 Compare May 20, 2026 20:55
mulugetam and others added 2 commits May 20, 2026 20:56
Adds an AVX512_SPR specialization path for ScalarQuantizer that uses
Sapphire Rapids-specific instructions for byte-code distance computation
on QT_8bit_direct and QT_8bit_direct_signed.

Inner product (8-bit codes):

  Replaces the AVX512 path that processes 16 bytes per iteration via
  cvtepu8_epi32 + mullo_epi32 with a VNNI loop that processes 64 bytes
  per iteration using _mm512_dpbusd_epi32. VNNI computes unsigned*signed
  dot products, so the standard bias trick is used to bridge
  unsigned*unsigned: subtract 128 from code2, run dpbusd, then add the
  128 * sum(code1) correction. A scalar tail handles d % 64.

  For QT_8bit_direct_signed (storage = value + 128), the same VNNI loop
  runs and an additional closed-form correction is applied:
      (a-128) * (b-128) = a*b - 128*(a+b) + 16384
  sum(a) and sum(b) are accumulated cheaply via _mm512_sad_epu8 (one
  PSADBW per 64-byte iteration).

L2 (8-bit codes):

  Replaces the 16-bytes-per-iter cvtepu8_epi32 + sub + mullo_epi32 path
  with a 16-bit pipeline: load 64 bytes, zero-extend to 16-bit lanes via
  _mm512_cvtepu8_epi16, subtract in 16-bit, square-and-accumulate to
  32-bit with _mm512_madd_epi16. Squared differences of two uint8_t
  values fit in 16 bits (max 255^2 = 65025), so the widened
  representation is safe. Falls through to a 32-byte step and a scalar
  tail for arbitrary d. The same kernel is bit-exact for the signed
  variant: (a - 128) - (b - 128) == a - b, so no correction is needed.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 22, 2026

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148661.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants