[Fix] Various fixes for 25.02.01 point release #695

rhdong · 2025-02-14T02:27:32Z

Fixes #694
Fixes #626
Fixes #680
Closes #706

Would prefer devtech approval

achirkin

The error and the solution look puzzling to me. Do you have an explanation for why this happens? The static compiler errors like this are not even related to CUDA version at all.

achirkin · 2025-02-14T12:30:54Z

cpp/src/neighbors/detail/cagra/cagra_merge.cuh

@@ -50,8 +50,7 @@ index<T, IdxT> merge(raft::resources const& handle,
  for (auto index : indices) {
    RAFT_EXPECTS(index != nullptr,
                 "Null pointer detected in 'indices'. Ensure all elements are valid before usage.");
-    using ds_idx_type = decltype(index->data().n_rows());
-    if (auto* strided_dset = dynamic_cast<const strided_dataset<T, ds_idx_type>*>(&index->data());
+    if (auto* strided_dset = dynamic_cast<const strided_dataset<T, int64_t>*>(&index->data());


Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to IdxT) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail.
If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g. using dataset_index_type = int64_t) and then make the dataset member use that type in the template parameter.

You can also try to just rename the index variable here. Maybe it confuses the compiler because of the same template type name.

You can also try to just rename the index variable here. Maybe it confuses the compiler because of the same template type name.

Hi @achirkin @cjnolet, I've tried it, but it still fails.

/cuvs/cpp/src/neighbors/detail/cagra/cagra_merge.cuh:56:21: error: '__T23' does not name a type 56 | using ds_idx_type = decltype(l_index->data().n_rows());

9357f8b

Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to IdxT) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail. If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g. using dataset_index_type = int64_t) and then make the dataset member use that type in the template parameter.

Yh, this works: 70e3036
And, if you @achirkin @cjnolet all believe this is the satisfied solution, I will commit it on this branch(now it is in my personal branch.) Thanks!

Hi @cjnolet @achirkin , I committed the dataset_index_t solution for time-saving,

achirkin

LGTM

cjnolet · 2025-02-14T17:39:15Z

cpp/include/cuvs/neighbors/cagra.hpp

@@ -337,6 +337,8 @@ struct index : cuvs::neighbors::index {
  using search_params_type = cagra::search_params;
  using index_type         = IdxT;
  using value_type         = T;
+  using dataset_index_type = int64_t;


I think this is where @achirkin is suggesting not to hard code the type. Ideally we should use a template for this so that it can be propagated outside of this class (and not hardcoded within it).

I think in this case it's fine; this says "the dataset member of the cagra index uses int64_t as the indexing type", so one can argue it belongs to the index, and it's also an implementation detail of the cagra index. Before this, my problem was that it was hardcoded in two different places with no compile-time relation between those (inside the cagra index and in the merge function).

- fix issue: rapidsai#626

Using a pool memory manager was causing crash with different threads. Modified a test to run parallely sometimes. Co-authored-by: Vivek Narang <[email protected]>

rhdong · 2025-02-19T17:13:58Z

Hi @chatman, FYI, I just cherry-picked your comment in the PR #705. Thanks a lot!

copy-pr-bot · 2025-02-19T21:13:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rhdong · 2025-02-19T21:15:31Z

/ok to test

rhdong · 2025-02-19T21:17:26Z

Hi @chatman , FYI, I cherry-picked your PR #706, and squashed all of the commits as required, many thanks! cc @cjnolet

rhdong · 2025-02-20T21:17:29Z

/ok to test

rhdong · 2025-02-22T00:28:37Z

java/internal/src/cuvs_java.c

@@ -272,6 +272,8 @@ void search_brute_force_index(cuvsBruteForceIndex_t index, float *queries, int t

  int64_t *neighbors;
  float *distances, *queries_d;
+  uint32_t *prefilter_d = NULL;


Hi @chatman @cjnolet, here is the pre-filter failure fix, please help to review it before merging. Thanks a lot!

rhdong · 2025-02-22T00:30:18Z

/ok to test

jakirkham · 2025-02-24T21:06:29Z

Seeing this test failure on CI:

__________________________ test_save_load_brute_force __________________________

    def test_save_load_brute_force():
>       run_save_load(brute_force, np.float32)

...
    
>       assert np.all(neighbors == neighbors2)
E       assert np.False_
E        +  where np.False_ = <function all at 0xfffeb6bcb970>(array([[7580,... 5962, 1326]]) == array([[7580,... 5962, 1326]])
E        +    where <function all at 0xfffeb6bcb970> = np.all
E           Full diff:
E             array([[7580, 8781, 6411, ..., 7207, 5741, 8819],
E                    [2243, 3069, 2205, ..., 4467, 1729, 5588],
E                    [7582,  760, 8989, ..., 7618, 9869,  267],
E                    ...,
E                    [6501,  887, 5725, ..., 9650, 2508, 8093],
E                    [2939, 5136, 6714, ..., 6589, 6463, 6416],
E                    [ 266, 2463, 4285, ..., 9844, 5962, 1326]],
E             ))

rhdong · 2025-02-24T21:16:11Z

Seeing this test failure on CI:

__________________________ test_save_load_brute_force __________________________

    def test_save_load_brute_force():
>       run_save_load(brute_force, np.float32)

...
    
>       assert np.all(neighbors == neighbors2)
E       assert np.False_
E        +  where np.False_ = <function all at 0xfffeb6bcb970>(array([[7580,... 5962, 1326]]) == array([[7580,... 5962, 1326]])
E        +    where <function all at 0xfffeb6bcb970> = np.all
E           Full diff:
E             array([[7580, 8781, 6411, ..., 7207, 5741, 8819],
E                    [2243, 3069, 2205, ..., 4467, 1729, 5588],
E                    [7582,  760, 8989, ..., 7618, 9869,  267],
E                    ...,
E                    [6501,  887, 5725, ..., 9650, 2508, 8093],
E                    [2939, 5136, 6714, ..., 6589, 6463, 6416],
E                    [ 266, 2463, 4285, ..., 9844, 5962, 1326]],
E             ))

Hi @jakirkham , please ignore it, and refer to this issue for more detail: #704

jakirkham · 2025-02-24T21:26:24Z

Thanks! 🙏

Corey mentioned the same thing offline and suggested to restart. Have done so

jakirkham · 2025-02-24T21:28:44Z

Have added "fixes"/"closes" notes in the OP. AIUI all of the listed issues/PRs are resolved by this one. That should ensure they close when this PR merges

Please feel free to revise further as needed

@rhdong

#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`. This fixes that. ## Notes for Reviewers ### How this fixes CI On #738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`: ```text BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda'] ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489)) But at test time, they were getting `25.05.00a84`: ```text ... libcuvs 25.04.00a84 cuda11_250224_ga1e0cc0_84 rapidsai-nightly ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583)) It looks like this `sklearn` dependency was the root cause, as explained in #743 (comment) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: #743

@rhdong

rapidsai#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`. This fixes that. ## Notes for Reviewers ### How this fixes CI On rapidsai#738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`: ```text BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda'] ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489)) But at test time, they were getting `25.05.00a84`: ```text ... libcuvs 25.04.00a84 cuda11_250224_ga1e0cc0_84 rapidsai-nightly ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583)) It looks like this `sklearn` dependency was the root cause, as explained in rapidsai#743 (comment) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: rapidsai#743

[Fix] CAGRA 'merge' API compilation error under CUDA 12.4

ac5cc15

- rapidsai#694

rhdong requested a review from cjnolet February 14, 2025 02:27

rhdong requested a review from a team as a code owner February 14, 2025 02:27

github-actions bot added the cpp label Feb 14, 2025

rhdong added breaking Introduces a breaking change bug Something isn't working labels Feb 14, 2025

cjnolet assigned rhdong Feb 14, 2025

cjnolet previously approved these changes Feb 14, 2025

View reviewed changes

achirkin requested changes Feb 14, 2025

View reviewed changes

rhdong requested review from achirkin and cjnolet February 14, 2025 16:35

fix by index::dataset_index_t

8447421

achirkin approved these changes Feb 14, 2025

View reviewed changes

cjnolet reviewed Feb 14, 2025

View reviewed changes

[Fix] Duplicated node issue on Cagra NN-descent

d4366ce

- fix issue: rapidsai#626

rhdong requested a review from cjnolet February 14, 2025 18:22

cjnolet approved these changes Feb 14, 2025

View reviewed changes

rhdong mentioned this pull request Feb 17, 2025

Java: Fix for a crash situation when using different threads (#680) #705

Closed

cjnolet changed the title ~~[Fix] CAGRA 'merge' API compilation error under CUDA 12.4~~ [Fix] Various fixes for 25.02.01 point release Feb 19, 2025

Fix for a crash situation when using different threads (rapidsai#680)

88b018d

Using a pool memory manager was causing crash with different threads. Modified a test to run parallely sometimes. Co-authored-by: Vivek Narang <[email protected]>

Java: Fixing the broken pre-filtering for Brute Force

1b36e8c

cjnolet approved these changes Feb 19, 2025

View reviewed changes

FIX add sklearn as dependency of cuvs-bench

85d8644

dantegd requested a review from a team as a code owner February 20, 2025 21:09

dantegd requested a review from bdice February 20, 2025 21:09

rhdong requested a review from dantegd February 20, 2025 21:17

dantegd mentioned this pull request Feb 21, 2025

Add sklearn as dependency of cuvs-bench #709

Open

[Fix] Brute force with prefilter java test failure

6a36092

rhdong commented Feb 22, 2025

View reviewed changes

raydouglass approved these changes Feb 24, 2025

View reviewed changes

raydouglass merged commit 1591029 into rapidsai:branch-25.02 Feb 24, 2025
64 of 66 checks passed

jameslamb mentioned this pull request Mar 3, 2025

rename 'sklearn' conda dependency to 'scikit-learn' #743

Merged

[Fix] Various fixes for 25.02.01 point release #695

[Fix] Various fixes for 25.02.01 point release #695

Uh oh!

Conversation

rhdong commented Feb 14, 2025 • edited by jakirkham Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

achirkin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

achirkin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

rhdong Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

rhdong Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

rhdong Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

cjnolet Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

achirkin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

rhdong commented Feb 19, 2025

Uh oh!

copy-pr-bot bot commented Feb 19, 2025

Uh oh!

rhdong commented Feb 19, 2025

Uh oh!

rhdong commented Feb 19, 2025

Uh oh!

rhdong commented Feb 20, 2025

Uh oh!

rhdong Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong commented Feb 22, 2025

Uh oh!

jakirkham commented Feb 24, 2025

Uh oh!

rhdong commented Feb 24, 2025

Uh oh!

jakirkham commented Feb 24, 2025

Uh oh!

jakirkham commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

rhdong commented Feb 14, 2025 •

edited by jakirkham

Loading

rhdong Feb 22, 2025 •

edited

Loading