-
Notifications
You must be signed in to change notification settings - Fork 98
IVF-PQ: low-precision coarse search #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
IVF-PQ: low-precision coarse search #715
Conversation
Hi Artem, thanks for the PR! Could you add tests for the new options? |
Sure, thanks for pointing this out! It's worth mentioning the int8 coarse search often gives the garbage recall and that is rather unavoidable. The problem is that we keep cluster norms as a part of cluster vectors and compute GEMM of the whole thing for the L2 case. But the norms are not normalized, so they grow very fast with the number of dimensions, which makes int8 representation impossible. I slightly improved the situation by encoding the norms into several int8 slots, but even that didn't help in many cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Artem for the PR! It looks good overall, but I have a few questions.
// 8-bit coarse search is experimental and there's no go guarantee of any recall | ||
// if the data is not normalized. Especially for L2, because we store vector norms alongside the | ||
// cluster centers. | ||
x.min_recall = 0.1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is the normalization requirement documented elsewhere?
- Can't we use our quantization API to set proper normalizaiton constants?
- Can we have larger
min_recall
by increasingnprobes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure the int8_t coarse search works correctly, we need even stricter requirements that all elements are smaller than one. I'm also not sure it makes sense to require L2 normalization (that all norms are smaller than 2m
), because that means reduce precision (if both the norm and the components are divided by the same constant).
I think increasing the nprobes
won't help a lot, because if the norm is out of range we basically get the random selection.
All in all I doubt the int8_t
variant will be useful, but we may reuse the code later by changing it to fp8 (and we can estimate the performance by running int8_t
now). Therefore I suppose there's a value in having it as an experimental feature and not invest too much in documentation and testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the half/int8 precision directly to balanced kmeans so that we can reuse the solution across other algorithms which use that?
Hi @cjnolet, the coarse search bits in IVF-PQ are rather not portable as it does cluster search + query type mapping at the same time and relies on IVF-PQ-specific representation (cluster center norms stored alongside the vectors), so unfortunately there's no code to share between the two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the update! The PR looks good to me.
I am not completely convinced that we need to add the int8 option, but you did add a clear explanation on its limitation, therefore I am fine with it.
Enable low-precision (half / int8) element type for use in the cuBLAS GEMM performed during coarse search (select clusters to probe). This makes cuBLAS use tensor cores and thus speeds up the coarse search.
Also propagate
kMaxQueries
compile time constant to a runtime search parameter: this allows to improve GPU utilization in extremely large batch size case, such as using IVF-PQ for constructing a nearest-neighbor graph for the whole dataset.