[FEA] Better usage of NN Descent as knn build algo

Using NN Descent to build the knn graph for UMAP and HDBSCAN is working okay, but a few improvements can be made.
Related PRs:
- HDBSCAN + NN Descent [here](https://github.com/rapidsai/cuml/pull/5939)
- UMAP + NN Descent [here](https://github.com/rapidsai/cuml/pull/5910)

**[Precision Issues with NN Descent]**
NN Descent in RAFT uses fp16 to calculate distances. This makes it  difficult to grab the distances calculated by NN Descent and compare it against fp32-calculated distances for additional operations outside of NN Descent such as making predictions. (also left as an issue [here](https://github.com/rapidsai/cuvs/issues/195))
- Problem
  - NN Descent calculates distances in fp16 (also only support L2Expanded metric)
- Current status
  - Right now, we grab the indices only, and call `refine()` on those indices to calculate fp32 + L2SqrtExpanded metric distance
- Possible solutions
  - Check why distance differences are causing dramatic drops in score
  - Support fp32 distance calculation in NN Descent -> we can just grab the distances from NN Descent instead of doing `refine()`


**[NN Descent parameters not always resulting in a better score]**
Since we are using `refine()`, increasing NN Descent parameters such as `max_iterations` or `graph_degree` should yield a better result (because it takes more neighbor candidates into consideration for refinement with a large graph degree, and it has a termination threshold so increasing `max_iterations` shouldn't really matter). However, this is not always the case and NN Descent is unstable for some specific test cases.
- Problem
  - NN Descent does not work as well as the brute force with the current test configurations of `test_membership_vector_circles` and `test_approximate_predict_blobs` in `cuml/python/cuml/tests/test_hdbscan.py`. Increasing NN Descent parameters does not solve this problem
- Current status
  - Separate tests for NN Descent for the two cases mentioned above.

**[Using NN Descent with HDBSCAN]**
HDBSCAN has two phases of building knn graph
1. Just build the knn graph for top k
2. Re-build the knn graph for top k using information (core dists) from step 1 to post-process the distance as they build.
Step 1 can be done using NN Descent + refine, but step 2 cannot be done using NN Descent
- Problem
  - Specifically, the core distances obtained after step 1 (which are L2SqrtExpanded fp32 distances from refine function) cannot be used to compare distances within NN Descent in step 2 because of the distance precisions
  - And the refine function doesn't have support for distance epilogues
- Current status
  - Do the first knn with NN Descent, and the second NN Descent with brute force knn
- Goal
  - use NN Descent for both phases of knn in HDBSCAN
- Possible solutions
  - Add distance epilogue support for refine function so that we can refine based on the mutual reachability distance 
  - Support fp32 distance calculation in NN Descent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Better usage of NN Descent as knn build algo #5942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Better usage of NN Descent as knn build algo #5942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions