Skip to content

Implement cccl-rt kernel launch patterns example#5892

Open
davebayer wants to merge 1 commit intoNVIDIA:mainfrom
davebayer:cudax_ex1
Open

Implement cccl-rt kernel launch patterns example#5892
davebayer wants to merge 1 commit intoNVIDIA:mainfrom
davebayer:cudax_ex1

Conversation

@davebayer
Copy link
Contributor

@davebayer davebayer commented Sep 16, 2025

Closes #5707.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Sep 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Sep 16, 2025
@davebayer davebayer force-pushed the cudax_ex1 branch 3 times, most recently from 0883fc1 to 3ce97db Compare September 19, 2025 16:21
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 🎉

Comment on lines +23 to +28
target_compile_options(${example_target} PRIVATE
$<$<COMPILE_LANG_AND_ID:CUDA,NVIDIA>:--extended-lambda>
)
target_compile_definitions(${example_target} PRIVATE
"LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt those already be part of the cudax global target?

Comment on lines +39 to +40
template <cuda::std::size_t N>
name_buffer(const char (&str)[N])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Should this rather be a constructor from a cuda::std::span?

@github-project-automation github-project-automation bot moved this from In Progress to In Review in CCCL Sep 22, 2025
};

#if defined(__CUDACC_EXTENDED_LAMBDA__)
// Kernel lambda is another form of the kernel functor. It can optionally take the kernel_config as the first argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately due to extended lambda restrictions it has to take the configuration as the first argument. We have no way to inspect the lambda signature to check if it does, so I decided to require the configuration to be passed


// The main advantage of using kernel config instead of the ordinary kernel parameters is that the config can carry
// statically defined extents. That means it is easier to generate kernels specialized for certain block sizes.
static_assert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'd love to see an example that uses C++20 concepts/requires clause to statically constrain the operator() to a specific block size. I think this is where the ability to statically specify and query grid extents really shines.

Comment on lines +183 to +187
// Launch a kernel functor that takes a cudax::kernel_config. Note that the kernel config is passed automatically as
// the first argument by the cudax::launch function.
const auto config =
cudax::make_config(hierarchy, cudax::dynamic_shared_memory<kernel_functor_with_config::dynamic_smem_layout>());
cudax::launch(stream, config, kernel_functor_with_config{}, name_buffer{"kernel functor with config"});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It's not clear what the purpose of the cudax::dynamic_shared_memory config option is. I think it needs more explanation both here and in the kernel itself.

#include <stdexcept>

#include <cuda.h>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Similar comment to the other examples, add a block summarizing what this example is for and what it does.

Comment on lines +208 to +213
// Launch a kernel functor that takes a cudax::kernel_config. Note that the kernel config is passed automatically as
// the first argument by the cudax::launch function.
//
// The kernel functor requires dynamic memory, so we need to create the kernel configuration with dynamic shared
// memory option. The config remembers the type passed inside the option and makes it the return type of the
// cudax::dynamic_smem_ref(config) call inside the device code. See demo_dynamic_shared_memory for more information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Lets fork off the dynamic_shared_memory to a separate example where we can dive into the nuance required here. To properply motivate this, we first need to explain why you can't just write code like this:

template <typename T>
__global__ void foo(T i){
    extern __shared__ T dyn_shmem[];
}

@@ -0,0 +1,254 @@
//===----------------------------------------------------------------------===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: On second thought, to make these examples easier to digest, I would suggest breaking this up into separate examples, one for each type of kernel launch. Otherwise for first time readers, it can be overwhelming and confusing to understand what parts are relevant.

Copy link
Collaborator

@jrhemstad jrhemstad Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a cccl-rt/kernel_launch/ directory and add the separate example .cu files there.

namespace cudax = cuda::experimental;

// A helper type for storing kernel launch patter name.
struct name_buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I get the motivation for this, but I believe in an example any extra noise should be eliminated as much as possible to avoid distracting from the thing we are trying to teach. I would suggest getting rid of the name_buffer thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted this to common.cuh header

@jrhemstad
Copy link
Collaborator

suggestion: For the launch pattern examples, I think users will appreciate opinionated guidance we can give them on which option is "best". For example, we'd want to explain to people that writing their kernel using the Config argument as the first argument is the best option and what benefits it gives them.

@pciolkosz
Copy link
Contributor

I would add two more cases, one that shows how to use the dynamic_shared_memory() option and one with default config attached to the kernel functor

@davebayer davebayer marked this pull request as ready for review March 5, 2026 18:30
@davebayer davebayer requested a review from a team as a code owner March 5, 2026 18:30
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

😬 CI Workflow Results

🟥 Finished in 20m 09s: Pass: 11%/9 | Total: 27m 56s | Max: 4m 15s

See results here.

@@ -0,0 +1,3 @@
# Kernel Launch Patterns

This example showcases how kernels and kernel functors can be launched using the `cuda::launch` function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It would help to enumerate and link to the different examples and provide a succinct summary of each example.

@bernhardmgruber bernhardmgruber changed the title Implement cccl-rt kernel launch paterns example Implement cccl-rt kernel launch patterns example Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Examples of Different Kernel Launch Patterns in cccl-runtime

4 participants