Skip to content

Add Keep Option Parameter to Distinct #18237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Mar 25, 2025

Conversation

warrickhe
Copy link
Contributor

@warrickhe warrickhe commented Mar 11, 2025

Description

cudf::lists::distinct currently uses KEEP_ANY. I would like to expose this duplicate keep option as a parameter, so that I can use another keep option. This is helpful for NVIDIA/spark-rapids#5221, as implementing array_distinct correctly would use KEEP_FIRST. This can also be helpful in the future.

Closes #18238

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Mar 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. labels Mar 11, 2025
Signed-off-by: Warrick He <[email protected]>
Signed-off-by: Warrick He <[email protected]>
@warrickhe warrickhe marked this pull request as ready for review March 12, 2025 22:18
@warrickhe warrickhe requested review from a team as code owners March 12, 2025 22:18
@warrickhe warrickhe requested review from lamarrr and davidwendt March 12, 2025 22:18
@mythrocks mythrocks self-requested a review March 12, 2025 22:31
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change labels Mar 12, 2025
@mythrocks
Copy link
Contributor

I'll try review this early tomorrow. I took a glance through the change.

I've added the Feature and Non-breaking labels to it, so as to pass CI checks.

@mythrocks
Copy link
Contributor

/ok to test

@warrickhe
Copy link
Contributor Author

Not sure how to deal with the deprecation here. The updated version should have everything with a default value, but this leads to ambiguous function calls (if we only provide first parameter). As a result, I have removed defaults for the first couple parameters in distinct() (idea that we can add these defaults back after we remove the deprecated function), but that means that old functions are calling the deprecated version instead. Should I just rename the the newer distinct()? This would probably work better, but doesn't seem like good practice to me.

@davidwendt
Copy link
Contributor

What old functions? We should change them to call the new function.
I think we can move the defaults to the new function and remove them from the old.
Would that work?

@warrickhe
Copy link
Contributor Author

@davidwendt So just remove the deprecated function entirely and replace it with the newer version? I'm worried that someone might be using libcudf distinct and doing something like distinct(input,null_equal,nan_equal,stream,mr) which would break with the new set of parameters distinct(input,null_equal,nan_equal,stream,mr). I suppose I can reorder the parameters so as to have duplicate_keep_option at the end but that doesn't seem like best practice.

@davidwendt
Copy link
Contributor

Leave the old function, just remove most of the defaults from it. And add defaults to the new function.

column distinct(input, null_eq, nan_eq, stream, mr=default);
column distinct(input, null_eq=default, nan_eq=default, keep=default, stream=default, mr=default)

Here are the scenarios I see:

  1. Calls distinct(input); automatically gets the new function
  2. Calls distinct(input, null_eq, nan_eq); automatically gets the new function
  3. Calls distinct(input, null_eq, nan_eq, stream); gets the old function
  4. Calls distinct(input, null_eq, nan_eq, stream, mr); gets the old function

Signed-off-by: Warrick He <[email protected]>
Signed-off-by: Warrick He <[email protected]>
@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

You'll need to go through the code here and update all the calls to cudf::lists::detail::distinct to call the new API otherwise the compiler will issue deprecation warnings and warnings are turned into errors. Any calls to cudf::lists::distinct will likely need to be updated as well.

* @copydoc cudf::lists::distinct(lists_column_view const&, null_equality, nan_equality,
* rmm::cuda_stream_view stream, rmm::device_async_resource_ref)
*/
[[deprecated]] std::unique_ptr<column> distinct(lists_column_view const& input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on changing the call to this function in ColumnViewJni.cpp, to get around the deprecation warning/error.

@warrickhe
Copy link
Contributor Author

warrickhe commented Mar 17, 2025

Looks like this build error: rapidsai/rmm#1861 (comment), not too sure if it's related.
Could I get a rerun on this? Thanks!

@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

@davidwendt davidwendt removed their assignment Mar 17, 2025
@warrickhe
Copy link
Contributor Author

Hmm, not sure how that error got past, it built fine and worked on spark. Updated it.

@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

/ok to test

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving C++ changes.

Co-authored-by: David Wendt <[email protected]>
@davidwendt davidwendt removed their assignment Mar 19, 2025
@warrickhe warrickhe changed the base branch from branch-25.04 to branch-25.06 March 20, 2025 17:27
@davidwendt
Copy link
Contributor

/ok to test

@warrickhe
Copy link
Contributor Author

Not sure why this failure is occurring? Seems kind of random to me, not sure how my changes would break anything on that side of things
https://github.com/rapidsai/cudf/actions/runs/13976729741/job/39134173941?pr=18237

@davidwendt
Copy link
Contributor

Not sure why this failure is occurring? Seems kind of random to me, not sure how my changes would break anything on that side of things https://github.com/rapidsai/cudf/actions/runs/13976729741/job/39134173941?pr=18237

This failure should not block this PR. Looks like you just need one more C++ review @mythrocks @lamarrr

@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 68c0fa4 into rapidsai:branch-25.06 Mar 25, 2025
111 of 112 checks passed
sameerz pushed a commit to NVIDIA/spark-rapids that referenced this pull request Apr 10, 2025
Closes #5221 
This change will add ArrayDistinct. The implementation uses CUDF's
dropListDuplicates. We treat -0.0 and +0.0 as the same. This behavior
can differ from Spark. Spark is inconsistent in behavior, and sometime
treats -0.0 and +0.0 the same, and sometime differently. See
https://issues.apache.org/jira/browse/SPARK-51475 for more info on this.

This change is dependent on a change in cudf:
rapidsai/cudf#18237
Without exposing a new duplicate keep option parameter, this code will
not work.

- [x] Add GpuArrayDistinct
- [x] Add test cases
- [ ] Performance benchmarking
- [x] Update documentation

---------

Signed-off-by: Warrick He <[email protected]>
Co-authored-by: Gera Shegalov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Would like to expose Duplicate Keep Option in Distinct
6 participants