Skip to content

Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931

Open
miscco wants to merge 3 commits intoNVIDIA:mainfrom
miscco:expose_pstl
Open

Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931
miscco wants to merge 3 commits intoNVIDIA:mainfrom
miscco:expose_pstl

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Mar 9, 2026

We discussed this internally and we are happy with the results of the parallel CUDA backend. So we want to expose this now rather than waiting for all algorithms to be implemented.

There are certain caveats:

  • We require random access iterators for the CUDA backend

  • We do not expose only a CUDA backend through cuda::execution::gpu. Standard execution policies will currently static_assert that there is a missing backend

  • We do not provide any fallback serial implementation. This would be dangerous, because the serial implementation would naively run on host and not device.

@miscco miscco requested review from a team as code owners March 9, 2026 10:49
@miscco miscco requested review from jrhemstad and shwina March 9, 2026 10:49
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 9, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 9, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

cuda::stream stream{cuda::device_ref{0}};
cuda::device_memory_pool_ref device_resource = cuda::device_default_memory_pool(stream.device());
const auto policy = cuda::execution::__cub_par_unseq.with_memory_resource(device_resource).with_stream(stream);
const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).with_stream(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: looking at this line, it may sound a bit better as:

Suggested change
const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).with_stream(stream);
const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).on_stream(stream);

But I guess nobody wants to do another bulk rename?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be another rename coming but not that one

@github-actions

This comment has been minimized.

Comment on lines +134 to +141
// parallel algorithms
#if _CCCL_HAS_PSTL_BACKEND()
# include <cuda/std/__pstl/adjacent_find.h>
# include <cuda/std/__pstl/all_of.h>
# include <cuda/std/__pstl/any_of.h>
# include <cuda/std/__pstl/copy.h>
# include <cuda/std/__pstl/copy_if.h>
# include <cuda/std/__pstl/copy_n.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I thought many standard libraries would expose PSTL algorithms through the <execution> header and not <algorithm>. This would make the inclusion of <algorithm> cheaper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @miscco offline and it seems the C++ standard requires the overloads to be in <algorithm>. However, it may not be observable to the common user, since they need to include <execution> in addition to supply an execution policy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not observable, then I would like to see exposing it in the <execution> header to avoid bloating <algorithm>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that is a correct statement.

<execution> can include it all and be fine, but then <algorithm> would not have it.

The point is that the pstl headers pull effectively all of <algorithm>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can include it all and be fine, but then would not have it.

Why is the advantage of <algorithm> having an overload that cannot be called if a user does not also include <execution>?

The point is that the pstl headers pull effectively all of

This is fine IMO, including a PSTL header can be more expensive.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

@miscco could you please measure the compile-time of

#include <cuda/std/algorithm>
int main() {
  return cuda::std::min(0, 2);
}

before and after this PR? I would be curious how much of an impact pulling in most of CUB has ;)

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 4h 24m: Pass: 100%/156 | Total: 7d 07h | Max: 4h 23m | Hits: 62%/369717

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants