Summary
slangpy generates and compiles a wrapper kernel module per call site at
runtime, in every process, with no first-class way to precompile that
work and ship it as a build artifact.
For any application with more than a handful of distinct call sites, each
new Python process pays seconds-to-minutes of slang→PTX codegen + CUDA
driver JIT before the first dispatch returns. Subsequent dispatches in
the same process are fast (in-memory pipeline_cache hit), but every new
process restarts cold.
slangtorch — the older C++-extension-based binding — solves the same
problem by moving all compilation to build time. There is no equivalent
flow in slangpy.
slangtorch's build pipeline:
.slang source
│
▼ slangc -target torch-binding (build time)
│
▼ generates a torch C++ extension source (.cpp / .cu)
│
▼ nvcc + g++ (build time)
│
▼
lib<name>_slang_cc.so (built once, on disk, shipped as build artifact)
At runtime:
import importlib
m = importlib.import_module("path.to.libfoo_slang_cc") # dlopen + symbol lookup
m.my_kernel(...).launchRaw(blockSize=..., gridSize=...) # straight CUDA dispatch
First-call cost: a fraction of a millisecond — just dlopen and the
CUDA driver loading PTX into a module. The driver's PTX→SASS JIT is
itself cached in $CUDA_CACHE_PATH (default $HOME/.nv/ComputeCache),
so it's amortised across processes.
The Python slangtorch package additionally provides
slangtorch.loadModule(...), which runs slangc + nvcc + ninja once
and caches the resulting .so on disk; subsequent loadModule calls
reuse the existing build. Same model — compile once, persist, reload
fast.
What slangpy does today
slangpy's dispatch model:
module.my_kernel(args)
│
▼ slangpy Python layer synthesises a wrapper slang source per call site
│
▼ session.load_module_from_source(hash, source) (slang frontend)
│
▼ session.link_program([wrapper, kernel_module]) (generic specialization)
│
▼ device.create_compute_pipeline(program) (slang → PTX)
│
▼ cuModuleLoadData(ptx) (driver PTX → SASS JIT)
│
▼ kernel dispatch
Every step from "wrapper source synthesis" through "PTX codegen" runs
on the first dispatch, in every process. None of it is currently
cacheable across processes through any first-class API:
- The Python wrapper synthesis runs every process (cheap, ~ms).
- The slang frontend (
load_module_from_source) runs every process — the
input is the freshly-generated wrapper string, which slang has to parse
from scratch.
link_program runs every process.
device.create_compute_pipeline → slang → PTX runs every process
unless Device.shader_cache_path is set and the path strings slang
stores in IR are stable across processes — typically not the case under
sandboxed tests, build systems with per-invocation paths, or any setup
that doesn't pin slang sources at a fixed absolute path.
cuModuleLoadData → driver PTX→SASS JIT runs every process unless
CUDA_CACHE_PATH points at a writable directory.
The dominant cost for typical workloads is the slang→PTX codegen for
specialised generic templates. Indicative numbers from a workload with
~30 distinct call sites (mix of plain entry points and heavy generic
specialisations) on an A40:
| call site type |
per-wrapper first-call cost (cold cache) |
| simple kernel (no generics) |
~3 s |
| medium generic |
~5 s |
heavy generic specialisation (e.g. Foo<3, 45, 5>) |
25–75 s |
Aggregate first-process startup cost in this workload: ~7 minutes
before the first iteration of training begins. The same workload with
the equivalent kernels built via slangtorch_library pays near-zero
startup cost.
Solution
A first-class build-time compilation flow analogous to
slangtorch_library. Any of the following shapes would be sufficient:
Summary
slangpy generates and compiles a wrapper kernel module per call site at
runtime, in every process, with no first-class way to precompile that
work and ship it as a build artifact.
For any application with more than a handful of distinct call sites, each
new Python process pays seconds-to-minutes of slang→PTX codegen + CUDA
driver JIT before the first dispatch returns. Subsequent dispatches in
the same process are fast (in-memory
pipeline_cachehit), but every newprocess restarts cold.
slangtorch — the older C++-extension-based binding — solves the same
problem by moving all compilation to build time. There is no equivalent
flow in slangpy.
slangtorch's build pipeline:
At runtime:
First-call cost: a fraction of a millisecond — just
dlopenand theCUDA driver loading PTX into a module. The driver's PTX→SASS JIT is
itself cached in
$CUDA_CACHE_PATH(default$HOME/.nv/ComputeCache),so it's amortised across processes.
The Python
slangtorchpackage additionally providesslangtorch.loadModule(...), which runsslangc + nvcc + ninjaonceand caches the resulting
.soon disk; subsequentloadModulecallsreuse the existing build. Same model — compile once, persist, reload
fast.
What slangpy does today
slangpy's dispatch model:
Every step from "wrapper source synthesis" through "PTX codegen" runs
on the first dispatch, in every process. None of it is currently
cacheable across processes through any first-class API:
load_module_from_source) runs every process — theinput is the freshly-generated wrapper string, which slang has to parse
from scratch.
link_programruns every process.device.create_compute_pipeline→ slang → PTX runs every processunless
Device.shader_cache_pathis set and the path strings slangstores in IR are stable across processes — typically not the case under
sandboxed tests, build systems with per-invocation paths, or any setup
that doesn't pin slang sources at a fixed absolute path.
cuModuleLoadData→ driver PTX→SASS JIT runs every process unlessCUDA_CACHE_PATHpoints at a writable directory.The dominant cost for typical workloads is the slang→PTX codegen for
specialised generic templates. Indicative numbers from a workload with
~30 distinct call sites (mix of plain entry points and heavy generic
specialisations) on an A40:
Foo<3, 45, 5>)Aggregate first-process startup cost in this workload: ~7 minutes
before the first iteration of training begins. The same workload with
the equivalent kernels built via
slangtorch_librarypays near-zerostartup cost.
Solution
A first-class build-time compilation flow analogous to
slangtorch_library. Any of the following shapes would be sufficient: