-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CIR][CUDA] Generate device stubs #1332
Conversation
clang/lib/CIR/CodeGen/CIRGenCUDA.cpp
Outdated
// Now emit the call to cudaLaunchKernel | ||
// cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim, dim3 blockDim, | ||
// void **args, size_t sharedMem, | ||
// cudaStream_t stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could have a ... = cir.cuda.setup_device_stub <name>, args, dim3_ty = <some_type>, stream_ty = <some_type2> ...
that will hide both __cudaPopCallConfiguration
and cudaLaunchKernel
calls. This will then be expanded in LoweringPrepare
to these calls (so we don't have to postpone this to LLVMLowering).
However, I'd rather see you adding this as-is first (after the other comment about OG is addressed) and in a follow up PR we can raise the representation and move it to LoweringPrepare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar things also happen at call site. Shall we also generate a cir.cuda.call_kernel
for that and expand in LoweringPrepare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which callsite do you mean? not the call happening in the device stub? if not in the device host, does it not call __cudaPopCallConfiguration
to retrieve the dims? Perhaps we should have a bit more of direct CIRGen to have a better grasp of uses of these internal functions before we raise them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the place where we invoke the kernel, for example in main
we can write global_fn<<<1, 1>>>(a, b, c)
. This is the place where we call __cudaPushCallConfiguration
for device stub to pop. I guess I'll directly generate them and adjust according to review.
✅ With the latest revision this PR passed the C/C++ code formatter. |
0832402
to
fe4f3c5
Compare
This PR deals with several issues currently present in CUDA CodeGen. Each of them requires only a few lines to fix, so they're combined in a single PR. **Bug 1.** Suppose we write ```cpp __global__ void kernel(int a, int b); ``` Then when we call this kernel with `cudaLaunchKernel`, the 4th argument to that function is something of the form `void *kernel_args[2] = {&a, &b}`. OG allocates the space of it with `alloca ptr, i32 2`, but that doesn't seem to be feasible in CIR, so we allocated `alloca [2 x ptr], i32 1`. This means there must be an extra GEP as compared to OG. In CIR, it means we must add an `array_to_ptrdecay` cast before trying to accessing the array elements. I missed that out in #1332 . **Bug 2.** We missed a load instruction for 6th argument to `cudaLaunchKernel`. It's added back in this PR. **Bug 3.** When we launch a kernel, we first retrieve the return value of `__cudaPopCallConfiguration`. If it's zero, then the call succeeds and we should proceed to call the device stub. In #1348 we did exactly the opposite, calling the device stub only if it's not zero. It's fixed here. **Issue 4.** CallConvLowering is required to make `cudaLaunchKernel` correct. The codepath is unblocked by adding a `getIndirectResult` at the same place as OG does -- the function is already implemented so we can just call it. After this (and other pending PRs), CIR is now able to compile real CUDA programs. There are still missing features, which will be followed up later.
Now we're able to generate device stubs.
A simple explanation:
We first store function arguments inside a
void* args[]
, which shall be passed intocudaLaunchKernel
.Then we retrieve configuration using
__cudaPopCallConfiguration
, popping the config pushed by callsite. (We can't generate calls to kernels currently.)Now we have enough arguments. Invoke
cudaLaunchKernel
and we're OK.