Conversation
This comment has been minimized.
This comment has been minimized.
I think that's the wrong approach. We should emit an NVTX range at the first API entry into CUB. Later CUB algorithms should not emit NVTX ranges. |
cub/cub/device/device_copy.cuh
Outdated
| typename EnvT = ::cuda::std::execution::env<>, | ||
| ::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIt, void*>, int> = 0> |
There was a problem hiding this comment.
Suggestion: We do not need a constraint here, since Batched here has 4-5 parameters, and the old overload has to 6-7, so they can never be ambiguous.
| typename EnvT = ::cuda::std::execution::env<>, | |
| ::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIt, void*>, int> = 0> | |
| typename EnvT = ::cuda::std::execution::env<>> |
| auto in_end = in_start + mdspan_in.mapping().required_span_size(); | ||
| auto out_start = mdspan_out.data_handle(); | ||
| auto out_end = out_start + mdspan_out.mapping().required_span_size(); | ||
| _CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap"); |
There was a problem hiding this comment.
Please retain the comment
| _CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap"); | |
| // TODO(fbusato): replace with __are_ptrs_overlapping | |
| _CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap"); |
cub/cub/device/device_copy.cuh
Outdated
| { | ||
| // no nvtx range because Copy delegates to Transform/ForEachInExtents which emit their own NVTX ranges |
There was a problem hiding this comment.
Critical: Please emit an NVTX range here.
There was a problem hiding this comment.
how? comment explains why i can't you explained it right underneath
| mdspan_in.size(), | ||
| ::cuda::proclaim_copyable_arguments(::cuda::std::identity{}), | ||
| stream); | ||
| env); |
There was a problem hiding this comment.
Suggestion: to avoid cub::DeviceTransform emitting an NVTX range here, maybe just call TransformInternal instead. If you rename TransformInternal yo __trasnform_internal you can make it public.
| @@ -66,11 +67,11 @@ copy(::cuda::std::mdspan<T_In, E_In, L_In, A_In> mdspan_in, | |||
| mdspan_out.data_handle(), | |||
| mdspan_in.size(), | |||
| ::cuda::proclaim_copyable_arguments(::cuda::std::identity{}), | |||
There was a problem hiding this comment.
Suggestion: You can remove proclaim_copyable_arguments as well, since it's not needed.
| // TODO (fbusato): add ForEachInLayout when mdspan_in and mdspan_out have compatible layouts | ||
| // Compatible layouts could use more efficient iteration patterns | ||
| return cub::DeviceFor::ForEachInExtents(mdspan_in.extents(), copy_mdspan_t{mdspan_in, mdspan_out}, stream); | ||
| return cub::DeviceFor::ForEachInExtents(mdspan_in.extents(), copy_mdspan_t{mdspan_in, mdspan_out}, env); |
There was a problem hiding this comment.
Suggestion: to avoid this API to emit an NVTX range, we have two options:
- We add an internal version of
ForEachInExtentsthat does not emit the range - We add a secret parameter to the
envthat is checked inForEachInExtentsand suppresses the NVTX range.
- Rename TransformInternal to __transform_internal and make it public - Add DeviceFor::__for_each_in_extents_internal (no NVTX, bypasses Bulk) - Update dispatch_copy_mdspan to call internal APIs instead of public ones - Remove unnecessary proclaim_copyable_arguments wrapper - Add _CCCL_NVTX_RANGE_SCOPE to DeviceCopy::Copy
😬 CI Workflow Results🟥 Finished in 1h 27m: Pass: 14%/249 | Total: 3d 07h | Max: 1h 10m | Hits: 86%/42649See results here. |
fixes #7539
This one is a bit peculiar.
DeviceCopy::Batchedusesdispatch_with_envlike other CUB env overloads — it has its own kernel dispatch viaDispatchBatchMemcpy.DeviceCopy::Copydoes not launch its own kernels, but it rather delegates entirely toDeviceTransform::TransformandDeviceFor::ForEachInExtents, both of which already accept an env. The env is forwarded directly throughdetail::copy_mdspan::copy, which is generalized fromcudaStream_tto a templatedEnvTparameter.Because Copy delegates to APIs that already emit NVTX ranges, the Copy env overload does not emit its own
_CCCL_NVTX_RANGE_SCOPEto avoid nested NVTX range violations.mdspan Copy env tests are only in
_env_api.cu, notin _env.cu. The_env.cufile definesstream_registry_factory_tas theCUB_DETAIL_DEFAULT_KERNEL_LAUNCHER_FACTORY, which intercepts kernel launches to verify env usage. Since mdspan Copy does not launch its own kernels, the factory intercepts at the wrong level (insideTransform/ForEachInExtents), causing nvcc compilation failures. The _env_api.cu tests (which don't use the factory) provide sufficient coverage for the mdspan Copy env path.