Skip to content

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Oct 8, 2025

Closes #1370
Closes #195

From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.

We no longer use wmma in the fp32 kernel. Originally wmma was done on matrices of shape [64 x 32] x [32 x 64] per block (multiple iterations if data dimension is larger than 32).
We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.

All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using sklearn.neighbors.NearestNeighbors brute for method.

Max iters=20

Screenshot 2025-10-08 at 10 56 17 AM

For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.

Max iters=100

Screenshot 2025-10-08 at 10 58 26 AM

Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)

@jinsolp jinsolp self-assigned this Oct 8, 2025
@jinsolp jinsolp requested a review from a team as a code owner October 8, 2025 17:59
@jinsolp jinsolp added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

[BUG] cuVS Nearest Neighbors recall lower than expected for some datasets [FEA] Calculating distances with FP32 in NN Descent

1 participant