Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931
Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931miscco wants to merge 3 commits intoNVIDIA:mainfrom
<cuda/std/algorithm> and <cuda/std/numeric>#7931Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| cuda::stream stream{cuda::device_ref{0}}; | ||
| cuda::device_memory_pool_ref device_resource = cuda::device_default_memory_pool(stream.device()); | ||
| const auto policy = cuda::execution::__cub_par_unseq.with_memory_resource(device_resource).with_stream(stream); | ||
| const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).with_stream(stream); |
There was a problem hiding this comment.
Remark: looking at this line, it may sound a bit better as:
| const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).with_stream(stream); | |
| const auto policy = cuda::execution::gpu.with_memory_resource(device_resource).on_stream(stream); |
But I guess nobody wants to do another bulk rename?
There was a problem hiding this comment.
There might be another rename coming but not that one
This comment has been minimized.
This comment has been minimized.
| // parallel algorithms | ||
| #if _CCCL_HAS_PSTL_BACKEND() | ||
| # include <cuda/std/__pstl/adjacent_find.h> | ||
| # include <cuda/std/__pstl/all_of.h> | ||
| # include <cuda/std/__pstl/any_of.h> | ||
| # include <cuda/std/__pstl/copy.h> | ||
| # include <cuda/std/__pstl/copy_if.h> | ||
| # include <cuda/std/__pstl/copy_n.h> |
There was a problem hiding this comment.
Q: I thought many standard libraries would expose PSTL algorithms through the <execution> header and not <algorithm>. This would make the inclusion of <algorithm> cheaper.
There was a problem hiding this comment.
Discussed this with @miscco offline and it seems the C++ standard requires the overloads to be in <algorithm>. However, it may not be observable to the common user, since they need to include <execution> in addition to supply an execution policy.
There was a problem hiding this comment.
If it's not observable, then I would like to see exposing it in the <execution> header to avoid bloating <algorithm>.
There was a problem hiding this comment.
I do not believe that is a correct statement.
<execution> can include it all and be fine, but then <algorithm> would not have it.
The point is that the pstl headers pull effectively all of <algorithm>
There was a problem hiding this comment.
can include it all and be fine, but then would not have it.
Why is the advantage of <algorithm> having an overload that cannot be called if a user does not also include <execution>?
The point is that the pstl headers pull effectively all of
This is fine IMO, including a PSTL header can be more expensive.
This comment has been minimized.
This comment has been minimized.
|
@miscco could you please measure the compile-time of #include <cuda/std/algorithm>
int main() {
return cuda::std::min(0, 2);
}before and after this PR? I would be curious how much of an impact pulling in most of CUB has ;) |
🥳 CI Workflow Results🟩 Finished in 4h 24m: Pass: 100%/156 | Total: 7d 07h | Max: 4h 23m | Hits: 62%/369717See results here. |
We discussed this internally and we are happy with the results of the parallel CUDA backend. So we want to expose this now rather than waiting for all algorithms to be implemented.
There are certain caveats:
We require random access iterators for the CUDA backend
We do not expose only a CUDA backend through
cuda::execution::gpu. Standard execution policies will currently static_assert that there is a missing backendWe do not provide any fallback serial implementation. This would be dangerous, because the serial implementation would naively run on host and not device.