-
Notifications
You must be signed in to change notification settings - Fork 942
Add Keep Option Parameter to Distinct #18237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Keep Option Parameter to Distinct #18237
Conversation
Signed-off-by: Warrick He <[email protected]>
Signed-off-by: Warrick He <[email protected]>
Signed-off-by: Warrick He <[email protected]>
I'll try review this early tomorrow. I took a glance through the change. I've added the |
/ok to test |
Not sure how to deal with the deprecation here. The updated version should have everything with a default value, but this leads to ambiguous function calls (if we only provide first parameter). As a result, I have removed defaults for the first couple parameters in distinct() (idea that we can add these defaults back after we remove the deprecated function), but that means that old functions are calling the deprecated version instead. Should I just rename the the newer distinct()? This would probably work better, but doesn't seem like good practice to me. |
What old functions? We should change them to call the new function. |
@davidwendt So just remove the deprecated function entirely and replace it with the newer version? I'm worried that someone might be using libcudf distinct and doing something like distinct(input,null_equal,nan_equal,stream,mr) which would break with the new set of parameters distinct(input,null_equal,nan_equal,stream,mr). I suppose I can reorder the parameters so as to have duplicate_keep_option at the end but that doesn't seem like best practice. |
Leave the old function, just remove most of the defaults from it. And add defaults to the new function.
Here are the scenarios I see:
|
Signed-off-by: Warrick He <[email protected]>
Signed-off-by: Warrick He <[email protected]>
/ok to test |
You'll need to go through the code here and update all the calls to |
* @copydoc cudf::lists::distinct(lists_column_view const&, null_equality, nan_equality, | ||
* rmm::cuda_stream_view stream, rmm::device_async_resource_ref) | ||
*/ | ||
[[deprecated]] std::unique_ptr<column> distinct(lists_column_view const& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job on changing the call to this function in ColumnViewJni.cpp
, to get around the deprecation warning/error.
Looks like this build error: rapidsai/rmm#1861 (comment), not too sure if it's related. |
/ok to test |
Signed-off-by: Warrick He <[email protected]>
Hmm, not sure how that error got past, it built fine and worked on spark. Updated it. |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving C++ changes.
Co-authored-by: David Wendt <[email protected]>
Signed-off-by: Warrick He <[email protected]>
/ok to test |
Not sure why this failure is occurring? Seems kind of random to me, not sure how my changes would break anything on that side of things |
This failure should not block this PR. Looks like you just need one more C++ review @mythrocks @lamarrr |
/ok to test |
/merge |
68c0fa4
into
rapidsai:branch-25.06
Closes #5221 This change will add ArrayDistinct. The implementation uses CUDF's dropListDuplicates. We treat -0.0 and +0.0 as the same. This behavior can differ from Spark. Spark is inconsistent in behavior, and sometime treats -0.0 and +0.0 the same, and sometime differently. See https://issues.apache.org/jira/browse/SPARK-51475 for more info on this. This change is dependent on a change in cudf: rapidsai/cudf#18237 Without exposing a new duplicate keep option parameter, this code will not work. - [x] Add GpuArrayDistinct - [x] Add test cases - [ ] Performance benchmarking - [x] Update documentation --------- Signed-off-by: Warrick He <[email protected]> Co-authored-by: Gera Shegalov <[email protected]>
Description
cudf::lists::distinct
currently uses KEEP_ANY. I would like to expose this duplicate keep option as a parameter, so that I can use another keep option. This is helpful for NVIDIA/spark-rapids#5221, as implementing array_distinct correctly would use KEEP_FIRST. This can also be helpful in the future.Closes #18238
Checklist