Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173
Add Sapphire Rapids optimizations for ScalarQuantizer (L2, IP)#5173mulugetam wants to merge 2 commits into
Conversation
|
Thanks for the PR. |
|
@mdouze Thanks for the review. The results are bit-exact for For I can document this as a known tradeoff (speed vs. precision for Regarding Quick question: are the DD changes complete, or still in progress? I'd like to re-review and rebase all my other pending PRs (I believe there are 9) once they land. Thanks! |
|
I think that the QT_8bit_direct_signed can be performed in the integer domain, that's the initial purpose of 8bit_direct_x |
09cf1fb to
0bd934e
Compare
|
@mdouze Rebased with fixes. |
0bd934e to
9e5c346
Compare
Adds an AVX512_SPR specialization path for ScalarQuantizer that uses
Sapphire Rapids-specific instructions for byte-code distance computation
on QT_8bit_direct and QT_8bit_direct_signed.
Inner product (8-bit codes):
Replaces the AVX512 path that processes 16 bytes per iteration via
cvtepu8_epi32 + mullo_epi32 with a VNNI loop that processes 64 bytes
per iteration using _mm512_dpbusd_epi32. VNNI computes unsigned*signed
dot products, so the standard bias trick is used to bridge
unsigned*unsigned: subtract 128 from code2, run dpbusd, then add the
128 * sum(code1) correction. A scalar tail handles d % 64.
For QT_8bit_direct_signed (storage = value + 128), the same VNNI loop
runs and an additional closed-form correction is applied:
(a-128) * (b-128) = a*b - 128*(a+b) + 16384
sum(a) and sum(b) are accumulated cheaply via _mm512_sad_epu8 (one
PSADBW per 64-byte iteration).
L2 (8-bit codes):
Replaces the 16-bytes-per-iter cvtepu8_epi32 + sub + mullo_epi32 path
with a 16-bit pipeline: load 64 bytes, zero-extend to 16-bit lanes via
_mm512_cvtepu8_epi16, subtract in 16-bit, square-and-accumulate to
32-bit with _mm512_madd_epi16. Squared differences of two uint8_t
values fit in 16 bits (max 255^2 = 65025), so the widened
representation is safe. Falls through to a 32-byte step and a scalar
tail for arbitrary d. The same kernel is bit-exact for the signed
variant: (a - 128) - (b - 128) == a - b, so no correction is needed.
Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
|
@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148661. |
This PR specializes the byte-vector distance path for
AVX512_SPRonQT_8bit_directandQT_8bit_direct_signed, achieving speedups of up to 2.3x on distance benchmarks and up to 1.21x on search (IndexIVFScalarQuantizer) compared to the existing SPR implementation.Inner product: A 64-byte VNNI loop using
_mm512_dpbusd_epi32replaces the 16-bytecvtepu8_epi32+mullo_epi32path. For the unsigned×unsigned case, the operand is biased by −128; a closed-form correction usingsum(a)andsum(b)(accumulated via_mm512_sad_epu8) restores the exact result. The signed variant applies the same correction terms.L2 distance: A 64-byte widened-multiply-add loop via
_mm512_cvtepu8_epi16+_mm512_madd_epi16replaces the narrower path. The signed variant is bit-exact because the −128 bias cancels in the difference.Improves upon and supersedes #5067.
Distance benchmark
QT_8bit_directQT_8bit_direct_signedQT_8bit_directQT_8bit_direct_signedQT_8bit_directQT_8bit_direct_signedRaw performance data: https://gist.github.com/mulugetam/72c0960e47bc640f99aa346f363e56fe
End-to-end search (
IndexIVFScalarQuantizer)QT_8bit_directQT_8bit_direct_signedRaw performance data: https://gist.github.com/mulugetam/632e2e08c9358b2184cbaa3397a6c73f