Implement cccl-rt kernel launch patterns example#5892
Implement cccl-rt kernel launch patterns example#5892davebayer wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
0883fc1 to
3ce97db
Compare
| target_compile_options(${example_target} PRIVATE | ||
| $<$<COMPILE_LANG_AND_ID:CUDA,NVIDIA>:--extended-lambda> | ||
| ) | ||
| target_compile_definitions(${example_target} PRIVATE | ||
| "LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE" | ||
| ) |
There was a problem hiding this comment.
Shouldnt those already be part of the cudax global target?
| template <cuda::std::size_t N> | ||
| name_buffer(const char (&str)[N]) |
There was a problem hiding this comment.
Nitpick: Should this rather be a constructor from a cuda::std::span?
| }; | ||
|
|
||
| #if defined(__CUDACC_EXTENDED_LAMBDA__) | ||
| // Kernel lambda is another form of the kernel functor. It can optionally take the kernel_config as the first argument. |
There was a problem hiding this comment.
Unfortunately due to extended lambda restrictions it has to take the configuration as the first argument. We have no way to inspect the lambda signature to check if it does, so I decided to require the configuration to be passed
|
|
||
| // The main advantage of using kernel config instead of the ordinary kernel parameters is that the config can carry | ||
| // statically defined extents. That means it is easier to generate kernels specialized for certain block sizes. | ||
| static_assert( |
There was a problem hiding this comment.
suggestion: I'd love to see an example that uses C++20 concepts/requires clause to statically constrain the operator() to a specific block size. I think this is where the ability to statically specify and query grid extents really shines.
| // Launch a kernel functor that takes a cudax::kernel_config. Note that the kernel config is passed automatically as | ||
| // the first argument by the cudax::launch function. | ||
| const auto config = | ||
| cudax::make_config(hierarchy, cudax::dynamic_shared_memory<kernel_functor_with_config::dynamic_smem_layout>()); | ||
| cudax::launch(stream, config, kernel_functor_with_config{}, name_buffer{"kernel functor with config"}); |
There was a problem hiding this comment.
suggestion: It's not clear what the purpose of the cudax::dynamic_shared_memory config option is. I think it needs more explanation both here and in the kernel itself.
| #include <stdexcept> | ||
|
|
||
| #include <cuda.h> | ||
|
|
There was a problem hiding this comment.
suggestion: Similar comment to the other examples, add a block summarizing what this example is for and what it does.
| // Launch a kernel functor that takes a cudax::kernel_config. Note that the kernel config is passed automatically as | ||
| // the first argument by the cudax::launch function. | ||
| // | ||
| // The kernel functor requires dynamic memory, so we need to create the kernel configuration with dynamic shared | ||
| // memory option. The config remembers the type passed inside the option and makes it the return type of the | ||
| // cudax::dynamic_smem_ref(config) call inside the device code. See demo_dynamic_shared_memory for more information. |
There was a problem hiding this comment.
suggestion: Lets fork off the dynamic_shared_memory to a separate example where we can dive into the nuance required here. To properply motivate this, we first need to explain why you can't just write code like this:
template <typename T>
__global__ void foo(T i){
extern __shared__ T dyn_shmem[];
}
| @@ -0,0 +1,254 @@ | |||
| //===----------------------------------------------------------------------===// | |||
There was a problem hiding this comment.
suggestion: On second thought, to make these examples easier to digest, I would suggest breaking this up into separate examples, one for each type of kernel launch. Otherwise for first time readers, it can be overwhelming and confusing to understand what parts are relevant.
There was a problem hiding this comment.
I'd suggest adding a cccl-rt/kernel_launch/ directory and add the separate example .cu files there.
| namespace cudax = cuda::experimental; | ||
|
|
||
| // A helper type for storing kernel launch patter name. | ||
| struct name_buffer |
There was a problem hiding this comment.
suggestion: I get the motivation for this, but I believe in an example any extra noise should be eliminated as much as possible to avoid distracting from the thing we are trying to teach. I would suggest getting rid of the name_buffer thing.
There was a problem hiding this comment.
I extracted this to common.cuh header
|
suggestion: For the launch pattern examples, I think users will appreciate opinionated guidance we can give them on which option is "best". For example, we'd want to explain to people that writing their kernel using the |
|
I would add two more cases, one that shows how to use the |
😬 CI Workflow Results🟥 Finished in 20m 09s: Pass: 11%/9 | Total: 27m 56s | Max: 4m 15sSee results here. |
| @@ -0,0 +1,3 @@ | |||
| # Kernel Launch Patterns | |||
|
|
|||
| This example showcases how kernels and kernel functors can be launched using the `cuda::launch` function. | |||
There was a problem hiding this comment.
suggestion: It would help to enumerate and link to the different examples and provide a succinct summary of each example.
Closes #5707.