-
Notifications
You must be signed in to change notification settings - Fork 98
[Fix] Various fixes for 25.02.01 point release #695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error and the solution look puzzling to me. Do you have an explanation for why this happens? The static compiler errors like this are not even related to CUDA version at all.
@@ -50,8 +50,7 @@ index<T, IdxT> merge(raft::resources const& handle, | |||
for (auto index : indices) { | |||
RAFT_EXPECTS(index != nullptr, | |||
"Null pointer detected in 'indices'. Ensure all elements are valid before usage."); | |||
using ds_idx_type = decltype(index->data().n_rows()); | |||
if (auto* strided_dset = dynamic_cast<const strided_dataset<T, ds_idx_type>*>(&index->data()); | |||
if (auto* strided_dset = dynamic_cast<const strided_dataset<T, int64_t>*>(&index->data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to IdxT
) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail.
If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g. using dataset_index_type = int64_t
) and then make the dataset member use that type in the template parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also try to just rename the index
variable here. Maybe it confuses the compiler because of the same template type name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also try to just rename the
index
variable here. Maybe it confuses the compiler because of the same template type name.
Hi @achirkin @cjnolet, I've tried it, but it still fails.
/cuvs/cpp/src/neighbors/detail/cagra/cagra_merge.cuh:56:21: error: '__T23' does not name a type
56 | using ds_idx_type = decltype(l_index->data().n_rows());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding the dataset index type like this is prone to error: if we decide to change it (let say to
IdxT
) in the index header at some point, it will compile and break the code unnoticed - the cast will simply fail. If you really can't use the decltype as above for some reason, perhaps a better way would be to define a new alias in the index definition (e.g.using dataset_index_type = int64_t
) and then make the dataset member use that type in the template parameter.
Yh, this works: 70e3036
And, if you @achirkin @cjnolet all believe this is the satisfied solution, I will commit it on this branch(now it is in my personal branch.) Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -337,6 +337,8 @@ struct index : cuvs::neighbors::index { | |||
using search_params_type = cagra::search_params; | |||
using index_type = IdxT; | |||
using value_type = T; | |||
using dataset_index_type = int64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where @achirkin is suggesting not to hard code the type. Ideally we should use a template for this so that it can be propagated outside of this class (and not hardcoded within it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case it's fine; this says "the dataset member of the cagra index uses int64_t as the indexing type", so one can argue it belongs to the index, and it's also an implementation detail of the cagra index. Before this, my problem was that it was hardcoded in two different places with no compile-time relation between those (inside the cagra index and in the merge function).
Using a pool memory manager was causing crash with different threads. Modified a test to run parallely sometimes. Co-authored-by: Vivek Narang <[email protected]>
/ok to test |
/ok to test |
@@ -272,6 +272,8 @@ void search_brute_force_index(cuvsBruteForceIndex_t index, float *queries, int t | |||
|
|||
int64_t *neighbors; | |||
float *distances, *queries_d; | |||
uint32_t *prefilter_d = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok to test |
Seeing this test failure on CI: __________________________ test_save_load_brute_force __________________________
def test_save_load_brute_force():
> run_save_load(brute_force, np.float32)
...
> assert np.all(neighbors == neighbors2)
E assert np.False_
E + where np.False_ = <function all at 0xfffeb6bcb970>(array([[7580,... 5962, 1326]]) == array([[7580,... 5962, 1326]])
E + where <function all at 0xfffeb6bcb970> = np.all
E Full diff:
E array([[7580, 8781, 6411, ..., 7207, 5741, 8819],
E [2243, 3069, 2205, ..., 4467, 1729, 5588],
E [7582, 760, 8989, ..., 7618, 9869, 267],
E ...,
E [6501, 887, 5725, ..., 9650, 2508, 8093],
E [2939, 5136, 6714, ..., 6589, 6463, 6416],
E [ 266, 2463, 4285, ..., 9844, 5962, 1326]],
E )) |
Hi @jakirkham , please ignore it, and refer to this issue for more detail: #704 |
Thanks! 🙏 Corey mentioned the same thing offline and suggested to restart. Have done so |
Have added "fixes"/"closes" notes in the OP. AIUI all of the listed issues/PRs are resolved by this one. That should ensure they close when this PR merges Please feel free to revise further as needed |
#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`. This fixes that. ## Notes for Reviewers ### How this fixes CI On #738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`: ```text BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda'] ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489)) But at test time, they were getting `25.05.00a84`: ```text ... libcuvs 25.04.00a84 cuda11_250224_ga1e0cc0_84 rapidsai-nightly ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583)) It looks like this `sklearn` dependency was the root cause, as explained in #743 (comment) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: #743
rapidsai#695 introduced a conda dependency on `"sklearn"`. There is no `sklearn`... the package is called `scikit-learn`. This fixes that. ## Notes for Reviewers ### How this fixes CI On rapidsai#738, @rhdong was facing the following problem. The `conda-cpp-build` jobs were producing packages with version `25.04.00a94`: ```text BUILD START: ['libcuvs-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-static-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-examples-25.04.00a94-cuda11_250303_g2ffe160_94.conda', 'libcuvs-tests-25.04.00a94-cuda11_250303_g2ffe160_94.conda'] ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38116419908?pr=738#step:9:489)) But at test time, they were getting `25.05.00a84`: ```text ... libcuvs 25.04.00a84 cuda11_250224_ga1e0cc0_84 rapidsai-nightly ``` ([logs link](https://github.com/rapidsai/cuvs/actions/runs/13636442011/job/38125502858?pr=738#step:9:4583)) It looks like this `sklearn` dependency was the root cause, as explained in rapidsai#743 (comment) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: rapidsai#743
Fixes #694
Fixes #626
Fixes #680
Closes #706