Skip to content

Device function calls: implement BIR_CALL for user __device__ functions #101

@Zaneham

Description

@Zaneham

Right now there's no codegen for calls to user-defined __device__ functions. BIR_CALL exists in the IR but isn't lowered for user functions, so anything non-trivial has to be inlined by hand into one big __global__. Moa's transport kernel is currently around 640 lines for exactly this reason. This issue tracks fixing that.

The thing that's actually missing is small. A call site doesn't need the callee's body, it just needs its contract: the param types, the return type, and the calling convention. We have nowhere to put that contract today, which is the whole reason we can't emit a call. The model I want to copy is JOVIAL's COMPOOL, which Skyhawk already implements over in src/cpl/. COMPOOL decouples a symbol's declaration from its definition through a symbol pool, so a call site resolves against the declaration while the body lives elsewhere. That decoupling is the entire game here.

The first step is an in-memory function declaration pool. We intern every __device__ function as a declaration record holding its name, param types, return type and ABI flags, independent of whether its body has been lowered yet, and call sites reference the callee by index. That's the half of COMPOOL we actually need right now. The other half, the on-disk format that lets two separately compiled modules agree on every byte, only matters once we want multiple translation units, which is still a known limitation, so it can wait. Same design, we just add the file format later when separate compilation lands.

From there the lowering splits two ways but shares the one pool. On the GPU targets the simplest correct thing is to inline at the BIR level before isel, which matches how GPUs treat device functions anyway since calls cost real registers and scratch, and it just automates the hand-inlining we do today. On CPU and RV64 we can emit real calls instead, which is nearly free because those backends already have a SysV call ABI. Because BIR is target-independent the declaration pool lives at the BIR level and every backend benefits, and starting with GPU inlining means we get correctness without having to touch the GPU call ABI (s_swappc_b64 on AMD, call/ret in PTX) until it's actually needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions