Skip to content

Move kernel compilation to build time #1001

@mkeshavaNV

Description

@mkeshavaNV

Summary

slangpy generates and compiles a wrapper kernel module per call site at
runtime, in every process
, with no first-class way to precompile that
work and ship it as a build artifact.

For any application with more than a handful of distinct call sites, each
new Python process pays seconds-to-minutes of slang→PTX codegen + CUDA
driver JIT before the first dispatch returns. Subsequent dispatches in
the same process are fast (in-memory pipeline_cache hit), but every new
process restarts cold.

slangtorch — the older C++-extension-based binding — solves the same
problem by moving all compilation to build time. There is no equivalent
flow in slangpy.

slangtorch's build pipeline:

   .slang source
       │
       ▼ slangc -target torch-binding   (build time)
       │
       ▼ generates a torch C++ extension source (.cpp / .cu)
       │
       ▼ nvcc + g++                     (build time)
       │
       ▼
   lib<name>_slang_cc.so   (built once, on disk, shipped as build artifact)

At runtime:

import importlib
m = importlib.import_module("path.to.libfoo_slang_cc")        # dlopen + symbol lookup
m.my_kernel(...).launchRaw(blockSize=..., gridSize=...)       # straight CUDA dispatch

First-call cost: a fraction of a millisecond — just dlopen and the
CUDA driver loading PTX into a module. The driver's PTX→SASS JIT is
itself cached in $CUDA_CACHE_PATH (default $HOME/.nv/ComputeCache),
so it's amortised across processes.

The Python slangtorch package additionally provides
slangtorch.loadModule(...), which runs slangc + nvcc + ninja once
and caches the resulting .so on disk; subsequent loadModule calls
reuse the existing build. Same model — compile once, persist, reload
fast.

What slangpy does today

slangpy's dispatch model:

   module.my_kernel(args)
       │
       ▼ slangpy Python layer synthesises a wrapper slang source per call site
       │
       ▼ session.load_module_from_source(hash, source)   (slang frontend)
       │
       ▼ session.link_program([wrapper, kernel_module])  (generic specialization)
       │
       ▼ device.create_compute_pipeline(program)         (slang → PTX)
       │
       ▼ cuModuleLoadData(ptx)                           (driver PTX → SASS JIT)
       │
       ▼ kernel dispatch

Every step from "wrapper source synthesis" through "PTX codegen" runs
on the first dispatch, in every process. None of it is currently
cacheable across processes through any first-class API:

  • The Python wrapper synthesis runs every process (cheap, ~ms).
  • The slang frontend (load_module_from_source) runs every process — the
    input is the freshly-generated wrapper string, which slang has to parse
    from scratch.
  • link_program runs every process.
  • device.create_compute_pipeline → slang → PTX runs every process
    unless Device.shader_cache_path is set and the path strings slang
    stores in IR are stable across processes — typically not the case under
    sandboxed tests, build systems with per-invocation paths, or any setup
    that doesn't pin slang sources at a fixed absolute path.
  • cuModuleLoadData → driver PTX→SASS JIT runs every process unless
    CUDA_CACHE_PATH points at a writable directory.

The dominant cost for typical workloads is the slang→PTX codegen for
specialised generic templates. Indicative numbers from a workload with
~30 distinct call sites (mix of plain entry points and heavy generic
specialisations) on an A40:

call site type per-wrapper first-call cost (cold cache)
simple kernel (no generics) ~3 s
medium generic ~5 s
heavy generic specialisation (e.g. Foo<3, 45, 5>) 25–75 s

Aggregate first-process startup cost in this workload: ~7 minutes
before the first iteration of training begins. The same workload with
the equivalent kernels built via slangtorch_library pays near-zero
startup cost
.

Solution

A first-class build-time compilation flow analogous to
slangtorch_library. Any of the following shapes would be sufficient:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions