Description
Using NN Descent to build the knn graph for UMAP and HDBSCAN is working okay, but a few improvements can be made.
Related PRs:
[Precision Issues with NN Descent]
NN Descent in RAFT uses fp16 to calculate distances. This makes it difficult to grab the distances calculated by NN Descent and compare it against fp32-calculated distances for additional operations outside of NN Descent such as making predictions. (also left as an issue here)
- Problem
- NN Descent calculates distances in fp16 (also only support L2Expanded metric)
- Current status
- Right now, we grab the indices only, and call
refine()
on those indices to calculate fp32 + L2SqrtExpanded metric distance
- Right now, we grab the indices only, and call
- Possible solutions
- Check why distance differences are causing dramatic drops in score
- Support fp32 distance calculation in NN Descent -> we can just grab the distances from NN Descent instead of doing
refine()
[NN Descent parameters not always resulting in a better score]
Since we are using refine()
, increasing NN Descent parameters such as max_iterations
or graph_degree
should yield a better result (because it takes more neighbor candidates into consideration for refinement with a large graph degree, and it has a termination threshold so increasing max_iterations
shouldn't really matter). However, this is not always the case and NN Descent is unstable for some specific test cases.
- Problem
- NN Descent does not work as well as the brute force with the current test configurations of
test_membership_vector_circles
andtest_approximate_predict_blobs
incuml/python/cuml/tests/test_hdbscan.py
. Increasing NN Descent parameters does not solve this problem
- NN Descent does not work as well as the brute force with the current test configurations of
- Current status
- Separate tests for NN Descent for the two cases mentioned above.
[Using NN Descent with HDBSCAN]
HDBSCAN has two phases of building knn graph
- Just build the knn graph for top k
- Re-build the knn graph for top k using information (core dists) from step 1 to post-process the distance as they build.
Step 1 can be done using NN Descent + refine, but step 2 cannot be done using NN Descent
- Problem
- Specifically, the core distances obtained after step 1 (which are L2SqrtExpanded fp32 distances from refine function) cannot be used to compare distances within NN Descent in step 2 because of the distance precisions
- And the refine function doesn't have support for distance epilogues
- Current status
- Do the first knn with NN Descent, and the second NN Descent with brute force knn
- Goal
- use NN Descent for both phases of knn in HDBSCAN
- Possible solutions
- Add distance epilogue support for refine function so that we can refine based on the mutual reachability distance
- Support fp32 distance calculation in NN Descent