Skip to content

[Fix] Various fixes for 25.02.01 point release #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 24, 2025

Conversation

@rhdong rhdong requested a review from cjnolet February 14, 2025 02:27
@rhdong rhdong requested a review from a team as a code owner February 14, 2025 02:27
@github-actions github-actions bot added the cpp label Feb 14, 2025
@rhdong rhdong added breaking Introduces a breaking change bug Something isn't working labels Feb 14, 2025
cjnolet
cjnolet previously approved these changes Feb 14, 2025
@cjnolet cjnolet dismissed their stale review February 14, 2025 12:17

Would prefer devtech approval

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error and the solution look puzzling to me. Do you have an explanation for why this happens? The static compiler errors like this are not even related to CUDA version at all.

@@ -50,8 +50,7 @@ index<T, IdxT> merge(raft::resources const& handle,
for (auto index : indices) {
RAFT_EXPECTS(index != nullptr,
"Null pointer detected in 'indices'. Ensure all elements are valid before usage.");
using ds_idx_type = decltype(index->data().n_rows());
if (auto* strided_dset = dynamic_cast<const strided_dataset<T, ds_idx_type>*>(&index->data());
if (auto* strided_dset = dynamic_cast<const strided_dataset<T, int64_t>*>(&index->data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to IdxT) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail.
If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g. using dataset_index_type = int64_t) and then make the dataset member use that type in the template parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also try to just rename the index variable here. Maybe it confuses the compiler because of the same template type name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also try to just rename the index variable here. Maybe it confuses the compiler because of the same template type name.

Hi @achirkin @cjnolet, I've tried it, but it still fails.

/cuvs/cpp/src/neighbors/detail/cagra/cagra_merge.cuh:56:21: error: '__T23' does not name a type
   56 |     using ds_idx_type = decltype(l_index->data().n_rows());

9357f8b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to IdxT) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail. If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g. using dataset_index_type = int64_t) and then make the dataset member use that type in the template parameter.

Yh, this works: 70e3036
And, if you @achirkin @cjnolet all believe this is the satisfied solution, I will commit it on this branch(now it is in my personal branch.) Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cjnolet @achirkin , I committed the dataset_index_t solution for time-saving,

@rhdong rhdong requested review from achirkin and cjnolet February 14, 2025 16:35
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -337,6 +337,8 @@ struct index : cuvs::neighbors::index {
using search_params_type = cagra::search_params;
using index_type = IdxT;
using value_type = T;
using dataset_index_type = int64_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where @achirkin is suggesting not to hard code the type. Ideally we should use a template for this so that it can be propagated outside of this class (and not hardcoded within it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case it's fine; this says "the dataset member of the cagra index uses int64_t as the indexing type", so one can argue it belongs to the index, and it's also an implementation detail of the cagra index. Before this, my problem was that it was hardcoded in two different places with no compile-time relation between those (inside the cagra index and in the merge function).

@rhdong rhdong requested a review from cjnolet February 14, 2025 18:22
@cjnolet cjnolet changed the title [Fix] CAGRA 'merge' API compilation error under CUDA 12.4 [Fix] Various fixes for 25.02.01 point release Feb 19, 2025
Using a pool memory manager was causing crash with different threads. Modified a test to run parallely sometimes.

Co-authored-by: Vivek Narang <[email protected]>
@rhdong
Copy link
Member Author

rhdong commented Feb 19, 2025

Hi @chatman, FYI, I just cherry-picked your comment in the PR #705. Thanks a lot!

Copy link

copy-pr-bot bot commented Feb 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rhdong
Copy link
Member Author

rhdong commented Feb 19, 2025

/ok to test

@rhdong
Copy link
Member Author

rhdong commented Feb 19, 2025

Hi @chatman , FYI, I cherry-picked your PR #706, and squashed all of the commits as required, many thanks! cc @cjnolet

@dantegd dantegd requested a review from a team as a code owner February 20, 2025 21:09
@dantegd dantegd requested a review from bdice February 20, 2025 21:09
@rhdong rhdong requested a review from dantegd February 20, 2025 21:17
@rhdong
Copy link
Member Author

rhdong commented Feb 20, 2025

/ok to test

@@ -272,6 +272,8 @@ void search_brute_force_index(cuvsBruteForceIndex_t index, float *queries, int t

int64_t *neighbors;
float *distances, *queries_d;
uint32_t *prefilter_d = NULL;
Copy link
Member Author

@rhdong rhdong Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chatman @cjnolet, here is the pre-filter failure fix, please help to review it before merging. Thanks a lot!

@rhdong
Copy link
Member Author

rhdong commented Feb 22, 2025

/ok to test

@jakirkham
Copy link
Member

Seeing this test failure on CI:

__________________________ test_save_load_brute_force __________________________

    def test_save_load_brute_force():
>       run_save_load(brute_force, np.float32)

...
    
>       assert np.all(neighbors == neighbors2)
E       assert np.False_
E        +  where np.False_ = <function all at 0xfffeb6bcb970>(array([[7580,... 5962, 1326]]) == array([[7580,... 5962, 1326]])
E        +    where <function all at 0xfffeb6bcb970> = np.all
E           Full diff:
E             array([[7580, 8781, 6411, ..., 7207, 5741, 8819],
E                    [2243, 3069, 2205, ..., 4467, 1729, 5588],
E                    [7582,  760, 8989, ..., 7618, 9869,  267],
E                    ...,
E                    [6501,  887, 5725, ..., 9650, 2508, 8093],
E                    [2939, 5136, 6714, ..., 6589, 6463, 6416],
E                    [ 266, 2463, 4285, ..., 9844, 5962, 1326]],
E             ))

@rhdong
Copy link
Member Author

rhdong commented Feb 24, 2025

Seeing this test failure on CI:

__________________________ test_save_load_brute_force __________________________

    def test_save_load_brute_force():
>       run_save_load(brute_force, np.float32)

...
    
>       assert np.all(neighbors == neighbors2)
E       assert np.False_
E        +  where np.False_ = <function all at 0xfffeb6bcb970>(array([[7580,... 5962, 1326]]) == array([[7580,... 5962, 1326]])
E        +    where <function all at 0xfffeb6bcb970> = np.all
E           Full diff:
E             array([[7580, 8781, 6411, ..., 7207, 5741, 8819],
E                    [2243, 3069, 2205, ..., 4467, 1729, 5588],
E                    [7582,  760, 8989, ..., 7618, 9869,  267],
E                    ...,
E                    [6501,  887, 5725, ..., 9650, 2508, 8093],
E                    [2939, 5136, 6714, ..., 6589, 6463, 6416],
E                    [ 266, 2463, 4285, ..., 9844, 5962, 1326]],
E             ))

Hi @jakirkham , please ignore it, and refer to this issue for more detail: #704

@jakirkham
Copy link
Member

Thanks! 🙏

Corey mentioned the same thing offline and suggested to restart. Have done so

@jakirkham
Copy link
Member

Have added "fixes"/"closes" notes in the OP. AIUI all of the listed issues/PRs are resolved by this one. That should ensure they close when this PR merges

Please feel free to revise further as needed

@raydouglass raydouglass merged commit 1591029 into rapidsai:branch-25.02 Feb 24, 2025
64 of 66 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 4, 2025
#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`.

This fixes that.

## Notes for Reviewers

### How this fixes CI

On #738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`:

```text
BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda']
```

([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489))

But at test time, they were getting `25.05.00a84`:

```text
...
libcuvs                   25.04.00a84     cuda11_250224_ga1e0cc0_84    rapidsai-nightly
```

([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583))

It looks like this `sklearn` dependency was the root cause, as explained in #743 (comment)

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: #743
jiangyinzuo pushed a commit to jiangyinzuo/cuvs that referenced this pull request Mar 27, 2025
rapidsai#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`.

This fixes that.

## Notes for Reviewers

### How this fixes CI

On rapidsai#738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`:

```text
BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda',
'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda']
```

([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489))

But at test time, they were getting `25.05.00a84`:

```text
...
libcuvs                   25.04.00a84     cuda11_250224_ga1e0cc0_84    rapidsai-nightly
```

([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583))

It looks like this `sklearn` dependency was the root cause, as explained in rapidsai#743 (comment)

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Introduces a breaking change bug Something isn't working cpp
Projects
Development

Successfully merging this pull request may close these issues.

6 participants