Move kernel compilation to build time

## Summary

slangpy generates and compiles a wrapper kernel module per call site **at
runtime, in every process**, with no first-class way to precompile that
work and ship it as a build artifact.

For any application with more than a handful of distinct call sites, each
new Python process pays seconds-to-minutes of slang→PTX codegen + CUDA
driver JIT before the first dispatch returns. Subsequent dispatches in
the same process are fast (in-memory `pipeline_cache` hit), but every new
process restarts cold.

slangtorch — the older C++-extension-based binding — solves the same
problem by moving all compilation to build time. There is no equivalent
flow in slangpy.


slangtorch's build pipeline:

```
   .slang source
       │
       ▼ slangc -target torch-binding   (build time)
       │
       ▼ generates a torch C++ extension source (.cpp / .cu)
       │
       ▼ nvcc + g++                     (build time)
       │
       ▼
   lib<name>_slang_cc.so   (built once, on disk, shipped as build artifact)
```

At runtime:

```python
import importlib
m = importlib.import_module("path.to.libfoo_slang_cc")        # dlopen + symbol lookup
m.my_kernel(...).launchRaw(blockSize=..., gridSize=...)       # straight CUDA dispatch
```

First-call cost: **a fraction of a millisecond** — just `dlopen` and the
CUDA driver loading PTX into a module. The driver's PTX→SASS JIT is
itself cached in `$CUDA_CACHE_PATH` (default `$HOME/.nv/ComputeCache`),
so it's amortised across processes.

The Python `slangtorch` package additionally provides
`slangtorch.loadModule(...)`, which runs `slangc + nvcc + ninja` once
and caches the resulting `.so` on disk; subsequent `loadModule` calls
reuse the existing build. Same model — compile once, persist, reload
fast.

## What slangpy does today

slangpy's dispatch model:

```
   module.my_kernel(args)
       │
       ▼ slangpy Python layer synthesises a wrapper slang source per call site
       │
       ▼ session.load_module_from_source(hash, source)   (slang frontend)
       │
       ▼ session.link_program([wrapper, kernel_module])  (generic specialization)
       │
       ▼ device.create_compute_pipeline(program)         (slang → PTX)
       │
       ▼ cuModuleLoadData(ptx)                           (driver PTX → SASS JIT)
       │
       ▼ kernel dispatch
```

Every step from "wrapper source synthesis" through "PTX codegen" runs
**on the first dispatch, in every process**. None of it is currently
cacheable across processes through any first-class API:

- The Python wrapper synthesis runs every process (cheap, ~ms).
- The slang frontend (`load_module_from_source`) runs every process — the
  input is the freshly-generated wrapper string, which slang has to parse
  from scratch.
- `link_program` runs every process.
- `device.create_compute_pipeline` → slang → PTX runs every process
  unless `Device.shader_cache_path` is set **and** the path strings slang
  stores in IR are stable across processes — typically not the case under
  sandboxed tests, build systems with per-invocation paths, or any setup
  that doesn't pin slang sources at a fixed absolute path.
- `cuModuleLoadData` → driver PTX→SASS JIT runs every process unless
  `CUDA_CACHE_PATH` points at a writable directory.

The dominant cost for typical workloads is the slang→PTX codegen for
specialised generic templates. Indicative numbers from a workload with
~30 distinct call sites (mix of plain entry points and heavy generic
specialisations) on an A40:

| call site type | per-wrapper first-call cost (cold cache) |
| --- | --- |
| simple kernel (no generics) | ~3 s |
| medium generic | ~5 s |
| heavy generic specialisation (e.g. `Foo<3, 45, 5>`) | 25–75 s |

Aggregate first-process startup cost in this workload: **~7 minutes**
before the first iteration of training begins. The same workload with
the equivalent kernels built via `slangtorch_library` pays **near-zero
startup cost**.

## Solution

A first-class build-time compilation flow analogous to
`slangtorch_library`. Any of the following shapes would be sufficient:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move kernel compilation to build time #1001

Summary

What slangpy does today

Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

call site type	per-wrapper first-call cost (cold cache)
simple kernel (no generics)	~3 s
medium generic	~5 s
heavy generic specialisation (e.g. `Foo<3, 45, 5>`)	25–75 s

Move kernel compilation to build time #1001

Description

Summary

What slangpy does today

Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions