Device function calls: implement BIR_CALL for user __device__ functions

Right now there's no codegen for calls to user-defined `__device__` functions. BIR_CALL exists in the IR but isn't lowered for user functions, so anything non-trivial has to be inlined by hand into one big `__global__`. Moa's transport kernel is currently around 640 lines for exactly this reason. This issue tracks fixing that.

The thing that's actually missing is small. A call site doesn't need the callee's body, it just needs its contract: the param types, the return type, and the calling convention. We have nowhere to put that contract today, which is the whole reason we can't emit a call. The model I want to copy is JOVIAL's COMPOOL, which Skyhawk already implements over in `src/cpl/`. COMPOOL decouples a symbol's declaration from its definition through a symbol pool, so a call site resolves against the declaration while the body lives elsewhere. That decoupling is the entire game here.

The first step is an in-memory function declaration pool. We intern every `__device__` function as a declaration record holding its name, param types, return type and ABI flags, independent of whether its body has been lowered yet, and call sites reference the callee by index. That's the half of COMPOOL we actually need right now. The other half, the on-disk format that lets two separately compiled modules agree on every byte, only matters once we want multiple translation units, which is still a known limitation, so it can wait. Same design, we just add the file format later when separate compilation lands.

From there the lowering splits two ways but shares the one pool. On the GPU targets the simplest correct thing is to inline at the BIR level before isel, which matches how GPUs treat device functions anyway since calls cost real registers and scratch, and it just automates the hand-inlining we do today. On CPU and RV64 we can emit real calls instead, which is nearly free because those backends already have a SysV call ABI. Because BIR is target-independent the declaration pool lives at the BIR level and every backend benefits, and starting with GPU inlining means we get correctness without having to touch the GPU call ABI (s_swappc_b64 on AMD, call/ret in PTX) until it's actually needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device function calls: implement BIR_CALL for user device functions #101

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Device function calls: implement BIR_CALL for user __device__ functions #101

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Device function calls: implement BIR_CALL for user device functions #101