diff --git a/.github/prompts/plan-extractCodegenToGenerator.prompt.md b/.github/prompts/plan-extractCodegenToGenerator.prompt.md new file mode 100644 index 000000000..e7473d62e --- /dev/null +++ b/.github/prompts/plan-extractCodegenToGenerator.prompt.md @@ -0,0 +1,208 @@ +## Extract codegen into generator.py + +**Goal**: Extract the code-emission logic from [callsignature.py](slangpy/core/callsignature.py) (`generate_code`, `generate_constants`, `KernelGenException`, helpers) and `BoundVariable.gen_call_data_code` from [boundvariable.py](slangpy/bindings/boundvariable.py) into a new [generator.py](slangpy/core/generator.py) file. The new file decomposes the monolithic `generate_code` (332 lines) into clearly-named sub-functions with doc comments showing what Slang code each one emits. `callsignature.py` retains the binding-pipeline functions (`specialize`, `bind`, `calculate_*`, etc.). Each step is a pure move/rename with no behavioral changes, verifiable by the existing test suites. + +**Parent plan**: [plan-simplifyKernelGenPhase2-cleanup.prompt.md](plan-simplifyKernelGenPhase2-cleanup.prompt.md) + +--- + +### Step 1: Create `slangpy/core/generator.py` with `generate_constants` and `KernelGenException` + +Move these small, self-contained pieces first: + +- **Move** `KernelGenException` (lines 40–43) from [callsignature.py](slangpy/core/callsignature.py#L40-L43). +- **Move** `is_slangpy_vector` (lines 240–247) from [callsignature.py](slangpy/core/callsignature.py#L240-L247) — private helper, prefix with `_`. +- **Move** `generate_constants` (lines 250–268) from [callsignature.py](slangpy/core/callsignature.py#L250-L268). +- **In [callsignature.py](slangpy/core/callsignature.py)**: Add `from slangpy.core.generator import KernelGenException, generate_constants` and delete the moved code. Keep a re-export of `KernelGenException` so any external consumer of the wildcard import from [calldata.py](slangpy/core/calldata.py#L8) continues to work. +- **In [dispatchdata.py](slangpy/core/dispatchdata.py#L7)**: Change `from slangpy.core.callsignature import generate_constants` → `from slangpy.core.generator import generate_constants`. + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass, no import errors. + +**DONE**: Created `slangpy/core/generator.py` with `KernelGenException`, `_is_slangpy_vector`, `generate_constants`. Replaced definitions in `callsignature.py` with re-exports. Updated `dispatchdata.py` import. 4999 passed, 5 pre-existing failures (raytrace d3d12, type conformance cache). + +--- + +### Step 2: Extract `gen_call_data_code` as a free function + +Move `BoundVariable.gen_call_data_code` (lines 604–693 of [boundvariable.py](slangpy/bindings/boundvariable.py#L604-L693)) into `generator.py` as a free function, along with the related `gen_calldata_type_name` helper (lines 258–272 of [boundvariable.py](slangpy/bindings/boundvariable.py#L258-L272)). + +- **In `generator.py`**: Create two free functions: + - `gen_calldata_type_name(binding: BoundVariable, cgb: CodeGenBlock, type_name: str) -> None` — same logic, takes `binding` as first arg instead of `self`. + - `gen_call_data_code(binding: BoundVariable, cg: CodeGen, context: BindContext, depth: int = 0) -> None` — same logic, recursive calls use the free function. References to `self` become `binding`. Internal calls to `self.gen_calldata_type_name(...)` become `gen_calldata_type_name(binding, ...)`. Recursive calls on children become `gen_call_data_code(child, cg, context, depth + 1)`. +- **In [boundvariable.py](slangpy/bindings/boundvariable.py)**: Replace the method bodies with thin delegations: + ```python + def gen_calldata_type_name(self, cgb, type_name): + from slangpy.core.generator import gen_calldata_type_name + gen_calldata_type_name(self, cgb, type_name) + + def gen_call_data_code(self, cg, context, depth=0): + from slangpy.core.generator import gen_call_data_code + gen_call_data_code(self, cg, context, depth) + ``` + This preserves the existing call interface (`node.gen_call_data_code(cg, context)` in [callsignature.py line 406](slangpy/core/callsignature.py#L406)) and any marshall subclass code that calls `self.gen_calldata_type_name`. The `MAX_INLINE_TYPE_LEN` constant moves to `generator.py`. +- **Move** the import of `CodeGen` and `CodeGenBlock` into `generator.py` (already needed for Step 1). + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. + +**DONE**: Moved `gen_call_data_code` and `gen_calldata_type_name` to `generator.py` as free functions. `MAX_INLINE_TYPE_LEN` moved to `generator.py`, re-exported from `boundvariable.py`. Method bodies replaced with thin delegation stubs. 3294 passed, 285 kernel gen tests passed. + +--- + +### Step 3a: Extract pure-computation helpers in-place in `callsignature.py` + +Extract the two helpers that do **no codegen** — pure calculation/validation only: + +- **Extract** `_validate_and_compute_group_shape(build_info, call_data_len) -> tuple[int, list[int], list[int]]` from lines [293–340](slangpy/core/callsignature.py#L293-L340). Returns `(call_group_size, call_group_strides, call_group_shape_vector)`. +- **Extract** `_data_name(x, use_entrypoint_args) -> str` — deduplicate the two inline occurrences at lines [449](slangpy/core/callsignature.py#L449) and [497](slangpy/core/callsignature.py#L497) into a single helper. Returns `__in_{name}`, `call_data.{name}`, or `_param_{name}`. + +Leave both in `callsignature.py` as module-private functions. `generate_code` calls them. + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. + +--- + +### Step 3b: Extract "setup" emission functions in-place in `callsignature.py` + +Extract the three functions that emit the top section of the generated kernel: + +- **Extract** `_emit_link_time_constants(cg, build_info, call_data_len, call_group_size, call_group_strides, call_group_shape_vector)` from lines [342–371](slangpy/core/callsignature.py#L342-L371). Emits `export static const int call_data_len = ...`, group stride/shape arrays; calls `generate_constants()`. +- **Extract** `_emit_shape_and_metadata_params(cg, call_data_len, use_entrypoint_args)` from lines [373–403](slangpy/core/callsignature.py#L373-L403). Emits `_grid_stride`, `_grid_dim`, `_call_dim`, `_thread_count` — as entry-point params (fast path) or `CallData` fields (fallback). +- **Extract** `_emit_call_data_definitions(cg, context, signature)` from lines [405–406](slangpy/core/callsignature.py#L405-L406). Emits per-variable call data (wrapper structs, type aliases, mapping constants) by calling `gen_call_data_code` on each node. + +Leave all three in `callsignature.py`. `generate_code` calls them. + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` and capture output as the baseline for Step 3c and 3d. + +--- + +### Step 3c: Extract "body" emission functions in-place in `callsignature.py` + +Extract the remaining three functions that emit the entry point and kernel body: + +- **Extract** `_emit_trampoline(cg, context, build_info, root_params, use_entrypoint_args)` from lines [408–500](slangpy/core/callsignature.py#L408-L500). Emits `[Differentiable] void _trampoline(...)` — param declarations, loads, function call, stores. +- **Extract** `_emit_entry_point_signature(cg, build_info, call_data_len, call_group_size, use_entrypoint_args)` from lines [503–541](slangpy/core/callsignature.py#L503-L541). Emits `[shader("compute")] [numthreads(...)] void compute_main(...)` or `[shader("raygen")] void raygen_main(...)`. +- **Extract** `_emit_kernel_body(cg, context, build_info, root_params, call_data_len, use_entrypoint_args)` from lines [543–603](slangpy/core/callsignature.py#L543-L603). Emits bounds check, `init_thread_local_call_shape_info`, Context construction, trampoline call. + +At this point `generate_code` is reduced to the ~30-line orchestrator below. Still in `callsignature.py`. + +```python +def generate_code(context, build_info, signature, cg): + use_entrypoint_args = context.use_entrypoint_args + cg.add_import("slangpy") + call_data_len = context.call_dimensionality + + call_group_size, strides, shape = _validate_and_compute_group_shape(build_info, call_data_len) + + cg.add_import(build_info.module.name) + if use_entrypoint_args: + cg.skip_call_data = True + + _emit_link_time_constants(cg, build_info, call_data_len, call_group_size, strides, shape) + _emit_shape_and_metadata_params(cg, call_data_len, use_entrypoint_args) + _emit_call_data_definitions(cg, context, signature) + + root_params = sorted(signature.values(), key=lambda x: x.param_index) + + _emit_trampoline(cg, context, build_info, root_params, use_entrypoint_args) + _emit_entry_point_signature(cg, build_info, call_data_len, call_group_size, use_entrypoint_args) + cg.kernel.begin_block() + _emit_kernel_body(cg, context, build_info, root_params, call_data_len, use_entrypoint_args) + cg.kernel.end_block() +``` + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Re-run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` and confirm output is byte-identical to the Step 3b baseline. + +--- + +### Step 3d: Move all codegen symbols from `callsignature.py` to `generator.py` and fix imports + +Now that everything is neatly decomposed, do the pure mechanical move: + +- **Move** all seven `_emit_*`/`_validate_*`/`_data_name` private helpers and the `generate_code` orchestrator from `callsignature.py` into `generator.py`. +- **In [callsignature.py](slangpy/core/callsignature.py)**: Delete the moved code; add `from slangpy.core.generator import generate_code` re-export so any consumer that imports `generate_code` from `callsignature` continues to work. +- **Update [calldata.py](slangpy/core/calldata.py#L8)**: Replace `from slangpy.core.callsignature import *` with explicit imports — binding-pipeline functions from `callsignature`, and `generate_code`, `KernelGenException` from `generator`. This eliminates the wildcard import, making dependencies explicit. + +**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Re-run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` — output byte-identical to Step 3b baseline. + +--- + +### Step 4: Clean up `callsignature.py` + +After Step 3, `callsignature.py` no longer has any codegen functions. Clean up: + +- Remove unused imports that were only needed by codegen (`CodeGen`, `PipelineType`, `AccessType`, `NoneMarshall`, `BoundVariableException` if no longer referenced). +- Remove re-exports of moved symbols once [calldata.py](slangpy/core/calldata.py) uses direct imports from `generator`. +- Add `from slangpy.core.generator import KernelGenException, ResolveException` re-exports **only if** external consumers import them from `callsignature` (check via grep). If only `calldata.py` uses them, the explicit import is sufficient. + +**Verify**: `pytest slangpy/tests/slangpy_tests -v`. `pre-commit run --all-files`. + +--- + +### Step 5: Add comments to `generator.py` sub-functions + +Enrich each sub-function's docstring with an example of the Slang code it generates, for both the fast path and fallback path. For example: + +```python +def _emit_shape_and_metadata_params( + cg: CodeGen, + call_data_len: int, + use_entrypoint_args: bool, +) -> None: + """Emit shape arrays and _thread_count. + + Fast path (entry-point params):: + + uniform int[2] _grid_stride + uniform int[2] _grid_dim + uniform int[2] _call_dim + uniform uint3 _thread_count + + Fallback (CallData struct fields):: + + int[2] _grid_stride; + int[2] _grid_dim; + int[2] _call_dim; + uint3 _thread_count; + """ +``` + +This is documentation-only, no functional changes. + +**Verify**: `pre-commit run --all-files` (formatting check). + +--- + +### Verification + +At each step: +```bash +cmake --build --preset windows-msvc-debug +pytest slangpy/tests/slangpy_tests -v +pre-commit run --all-files +``` + +After Step 3b specifically, capture generated shader output as a baseline; re-run after 3c and 3d to confirm byte-identical output: +```powershell +$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v +``` + +--- + +### Decisions + +- `gen_call_data_code` extracted as free function in `generator.py`; thin delegation stub kept on `BoundVariable` to preserve the method-call interface (`node.gen_call_data_code(cg, context)`) used in `generate_code` and potentially in external/user code. +- `generator.py` lives at `slangpy/core/generator.py` alongside `callsignature.py` and `calldata.py`. +- Wildcard import `from slangpy.core.callsignature import *` in `calldata.py` replaced with explicit imports to make dependencies clear. +- Sub-function names prefixed with `_` (private to the module); only `generate_code`, `generate_constants`, `gen_call_data_code`, `gen_calldata_type_name`, `KernelGenException` are public. + +--- + +### Key Files + +| File | Changes | +|------|---------| +| [slangpy/core/generator.py](slangpy/core/generator.py) | **NEW** — `generate_code`, `generate_constants`, `gen_call_data_code`, `gen_calldata_type_name`, `KernelGenException`, private helpers | +| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | Remove `generate_code`, `generate_constants`, `KernelGenException`, `is_slangpy_vector`; add re-exports from `generator` | +| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | `gen_call_data_code` and `gen_calldata_type_name` become thin delegation stubs; `MAX_INLINE_TYPE_LEN` moves out | +| [slangpy/core/calldata.py](slangpy/core/calldata.py) | Replace `from slangpy.core.callsignature import *` with explicit imports from `callsignature` and `generator` | +| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | Import `generate_constants` from `generator` instead of `callsignature` | diff --git a/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md new file mode 100644 index 000000000..2778f6071 --- /dev/null +++ b/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md @@ -0,0 +1,204 @@ +## Phase 1: Direct Type Marshalling + +**Status**: Prim-mode complete (Steps 1.1–1.7, 1.9). Step 1.8 (autodiff derivative fields) deferred. + +**Goal**: For dim-0, non-composite arguments, emit the raw Slang type in CallData and use direct assignment in the trampoline — eliminating `ValueType` wrappers, `__slangpy_load`/`__slangpy_store` indirection, mapping constants, and `Context.map()` calls. + +**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md) + +--- + +### Architecture + +Direct binding eligibility is determined by a **marshall-driven `can_direct_bind` property** combined with a **single depth-first `calculate_direct_bind` pass** on the `BoundVariable` tree. This follows the same pattern as `calculate_differentiability`. + +#### Key components + +| Component | Location | Role | +|-----------|----------|------| +| `Marshall.can_direct_bind(binding)` | `slangpy/bindings/marshall.py` | Virtual method (default `False`). Marshalls override to opt in. | +| `can_direct_bind_common(binding)` | `slangpy/bindings/boundvariable.py` | Shared eligibility checks (dim-0, no children, no param block). Marshalls call this then add type-specific logic. | +| `BoundVariable.direct_bind` | `slangpy/bindings/boundvariable.py` | Boolean attribute set by `calculate_direct_bind()`. Consumed by `gen_call_data_code`, `gen_calldata`, `gen_trampoline_load/store`, `create_calldata`. | +| `BoundVariable.calculate_direct_bind()` | `slangpy/bindings/boundvariable.py` | Depth-first tree pass. Leaves delegate to `marshall.can_direct_bind()`. Composites require all children to be direct-bind AND dim-0 with a concrete vector type. Children retain their individual `direct_bind` status regardless of the parent's eligibility. | +| `calculate_direct_binding(call)` | `slangpy/core/callsignature.py` | Top-level function iterating `call.args` + `call.kwargs.values()`, calling `arg.calculate_direct_bind()`. | +| `NativeBoundVariableRuntime.direct_bind` | `slangpy.h` / `boundvariableruntime.py` | C++ member + Python propagation. Read by `NativeValueMarshall::ensure_cached` to gate `["value"]` sub-field navigation. | + +#### Control flow + +``` +CallData.build() + → calculate_differentiability(context, bindings) + → calculate_direct_binding(bindings) ← NEW + → generate_code(...) + → gen_call_data_code() — reads binding.direct_bind + → gen_trampoline() — reads binding.direct_bind + → BoundCallRuntime(bindings) — propagates binding.direct_bind to C++ runtime +``` + +At dispatch time, `NativeValueMarshall::ensure_cached()` reads `binding->direct_bind()` to decide cursor navigation: +- `direct_bind == false`: `cursor[variable_name]["value"]` (wrapper path) +- `direct_bind == true`: `cursor[variable_name]` (raw type path) + +#### Composite (struct/dict) handling + +When `calculate_direct_bind()` visits a composite node: +1. Recurse children first (depth-first) +2. If all children have `direct_bind == True` AND the composite is dim-0 with a concrete vector type → set `self.direct_bind = True` +3. Otherwise → the composite is NOT direct-bind, but children **retain** their individual `direct_bind` status. Inside the parent's generated `__slangpy_load`/`__slangpy_store`, `gen_call_data_code` delegates to each child's `gen_trampoline_load`/`gen_trampoline_store` — direct-bind children get direct assignment (e.g., `value.y = y;`) while non-direct-bind children use the standard `__slangpy_load(context.map(...))` path. This allows mixed direct-bind / non-direct-bind children within the same struct. + +--- + +### Step 1.1: Define eligibility predicate + +**Implemented.** A `can_direct_bind(binding)` virtual method on `Marshall` (default `False`) replaces the original `is_direct_bind_eligible` / `is_direct_bind_recursive` global functions. Each marshall subclass overrides `can_direct_bind` to opt in. + +A shared helper `can_direct_bind_common(binding)` in `boundvariable.py` provides the common checks: +- `binding.call_dimensionality is not None and binding.call_dimensionality == 0` +- `not binding.children` (not composite/dict) +- `not getattr(binding, "create_param_block", False)` (excludes `PackedArg`) + +Marshall subclasses call `can_direct_bind_common(binding)` and optionally add type-specific logic. `StructMarshall` has its own implementation: if it has children, all children must have `direct_bind == True`; otherwise it delegates to `can_direct_bind_common`. `ValueRefMarshall` additionally requires `binding.access[0] == AccessType.read` — writable value refs need buffer read/write logic that is incompatible with direct binding. + +--- + +### Step 1.2: Implement for `ValueMarshall` (scalars/matrices) + +**Implemented.** In [slangpy/builtin/value.py](slangpy/builtin/value.py): + +- `can_direct_bind(binding)`: calls `can_direct_bind_common(binding)` +- `gen_calldata`: when `binding.direct_bind`, emits `typealias _t_{name} = {raw_slang_type}` instead of `ValueType<{type}>` +- `gen_trampoline_load`: when `binding.direct_bind`, emits `{name} = {data_name}` and returns `True` +- `gen_trampoline_store`: when `binding.direct_bind` (read-only), returns `True` (suppress default store) +- `create_calldata`: when `binding.direct_bind`, returns raw value instead of `{"value": data}` + +#### Step 1.2a: C++ fast path + +**Implemented.** `NativeValueMarshall::ensure_cached` in [slangpyvalue.cpp](src/slangpy_ext/utils/slangpyvalue.cpp) reads `binding->direct_bind()` from the `NativeBoundVariableRuntime`: + +```cpp +ShaderCursor field = binding->direct_bind() + ? cursor[binding->variable_name()] + : cursor[binding->variable_name()]["value"]; +``` + +The `direct_bind` flag is a `bool` member on `NativeBoundVariableRuntime` (declared in [slangpy.h](src/slangpy_ext/utils/slangpy.h)), exposed via nanobind property in [slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), and propagated from `BoundVariable.direct_bind` via [boundvariableruntime.py](slangpy/bindings/boundvariableruntime.py). + +The `m_direct_bind` / `direct_bind` / `set_direct_bind` members were **removed** from `NativeValueMarshall` — the flag lives exclusively on `NativeBoundVariableRuntime`. + +--- + +### Step 1.3: Implement for `VectorMarshall`, `MatrixMarshall`, and `ArrayMarshall` + +**Implemented.** All inherit `can_direct_bind` and `gen_trampoline_load`/`gen_trampoline_store` from `ValueMarshall`. `VectorMarshall` overrides `gen_calldata` to emit the raw vector type (e.g., `vector`) instead of `VectorValueType` when `binding.direct_bind`. `MatrixMarshall` and `ArrayMarshall` (at dim-0) inherit `ValueMarshall.gen_calldata`. + +--- + +### Step 1.4: Implement for `StructMarshall` (dict → struct) + +**Implemented.** In [slangpy/builtin/struct.py](slangpy/builtin/struct.py): + +- `can_direct_bind(binding)`: if `binding.children is not None`, returns `True` only if all children have `direct_bind == True`. Otherwise delegates to `can_direct_bind_common(binding)`. +- `gen_trampoline_load`: when `binding.direct_bind`, delegates to `ValueMarshall.gen_trampoline_load` (emits `{name} = {data_name}`) and returns `True`. Direct-bind structs are read-only, like other value types. +- `gen_trampoline_store`: when `binding.direct_bind`, delegates to `ValueMarshall.gen_trampoline_store` (suppresses store for read-only). Returns `True`. + +In [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py), `gen_call_data_code`: +- When `self.direct_bind`, emits `typealias _t_{name} = {vector_type.full_name}` (raw struct type) — skipping inline struct generation, `__slangpy_load`/`__slangpy_store`, and child type aliases. +- When NOT `self.direct_bind`, uses the standard children path with inline struct. Children **retain** their individual `direct_bind` status — `gen_call_data_code` calls each child's `gen_trampoline_load`/`gen_trampoline_store`, which emit direct assignment for direct-bind children and fall through to `__slangpy_load`/`__slangpy_store` for non-direct-bind children. + +--- + +### Step 1.5: Implement for `ValueRefMarshall` + +**Implemented.** In [slangpy/builtin/valueref.py](slangpy/builtin/valueref.py): + +- `can_direct_bind(binding)`: calls `can_direct_bind_common(binding)` AND requires `binding.access[0] == AccessType.read`. Writable value refs are NOT direct-bind eligible because they need buffer allocation and readback logic that requires the wrapper path. +- `gen_calldata`: when `binding.direct_bind`, emits raw type alias (read-only only). Non-direct-bind uses `ValueRef` / `RWValueRef` as before. +- `gen_trampoline_load`: when `binding.direct_bind`, emits direct assignment. Non-direct-bind falls through. +- `gen_trampoline_store`: when `binding.direct_bind`, returns `True` (suppress store for read-only). Non-direct-bind falls through. +- `create_calldata` / `read_calldata`: when `binding.direct_bind` AND read-only, returns raw value / skips readback. + +The old `self._direct_bind` attribute was **removed** — all checks now use `binding.direct_bind`. + +**Implication for `_result`:** Auto-created return values are writable `ValueRef` instances. Since writable value refs are not direct-bind eligible, `_result` uses `RWValueRef` with `__slangpy_store`, mapping constants, and the standard wrapper path. This is a deliberate constraint — writable value refs inside structs would prevent the struct from being direct-bind eligible, which is the correct behavior since the struct's `__slangpy_load`/`__slangpy_store` must exist to handle the buffer operations. + +--- + +### Step 1.6: Implement for tensor marshalls + +**Implemented.** In [slangpy/builtin/tensorcommon.py](slangpy/builtin/tensorcommon.py): + +`gen_trampoline_load/store` extended for `ITensorType` at dim-0 (direct struct assignment). Tensor marshalls do NOT implement `can_direct_bind` — tensor dim-0 handling is done via trampoline-level checks on `binding.call_dimensionality` and `binding.vector_type` type, independent of the `direct_bind` flag. + +--- + +### Step 1.7: Eliminate unused boilerplate in code generation + +**Implemented.** In [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py), `gen_call_data_code` skips emitting `static const int _m_{name} = 0` mapping constants when `self.direct_bind` is `True`. + +--- + +### Step 1.8: Handle autodiff (bwds mode) + +⬜ **Deferred.** Prim-mode direct binding applies to bwds primals (code gen verified), but derivative fields still use the old `ValueType` wrapper path. + +--- + +### Step 1.9: Tests + +**Implemented.** 21 tests × 3 device types = 63 cases. All pass on d3d12/vulkan/cuda. + +--- + +### Files Modified + +| File | Changes | +|------|---------| +| `src/slangpy_ext/utils/slangpy.h` | `m_direct_bind` member, `direct_bind()` getter, `set_direct_bind()` setter on `NativeBoundVariableRuntime` | +| `src/slangpy_ext/utils/slangpy.cpp` | Nanobind `direct_bind` property on `NativeBoundVariableRuntime` | +| `src/slangpy_ext/utils/slangpyvalue.h` | `m_direct_bind`, `direct_bind()`, `set_direct_bind()` **removed** from `NativeValueMarshall` | +| `src/slangpy_ext/utils/slangpyvalue.cpp` | `ensure_cached` reads `binding->direct_bind()` instead of `m_direct_bind`; nanobind `direct_bind` property **removed** from `NativeValueMarshall` | +| `slangpy/bindings/marshall.py` | `can_direct_bind(binding)` virtual method (default `False`) | +| `slangpy/bindings/boundvariable.py` | `can_direct_bind_common()`, `BoundVariable.direct_bind` attribute, `BoundVariable.calculate_direct_bind()`. Old functions removed: `is_direct_bind_eligible`, `is_direct_bind_recursive`, `_set_direct_bind_on_children`, `_force_no_direct_bind`, `_DIRECT_BIND_TYPES`, `_clear_direct_bind()`. | +| `slangpy/bindings/boundvariableruntime.py` | `self.direct_bind = source.direct_bind` propagation | +| `slangpy/bindings/__init__.py` | Exports `can_direct_bind_common` (removed `is_direct_bind_eligible`, `is_direct_bind_recursive`) | +| `slangpy/core/callsignature.py` | `calculate_direct_binding(call)` function | +| `slangpy/core/calldata.py` | `calculate_direct_binding(bindings)` call after `calculate_differentiability` | +| `slangpy/builtin/value.py` | `can_direct_bind`, `gen_calldata`, `gen_trampoline_load`, `gen_trampoline_store`, `create_calldata` use `binding.direct_bind`. Removed `self.direct_bind` on marshall. | +| `slangpy/builtin/valueref.py` | `can_direct_bind`, `gen_calldata`, `gen_trampoline_load`, `gen_trampoline_store`, `create_calldata`, `read_calldata` use `binding.direct_bind`. Removed `self._direct_bind`. | +| `slangpy/builtin/struct.py` | `can_direct_bind`, `gen_trampoline_load`, `gen_trampoline_store` use `binding.direct_bind` | +| `slangpy/builtin/tensorcommon.py` | `gen_trampoline_load`, `gen_trampoline_store` extended for `ITensorType` dim-0 (unchanged in refactor) | +| `slangpy/tests/slangpy_tests/test_kernel_gen.py` | All Phase 1 tests | + +### Test Results + +2952 passed / 0 failed in `slangpy/tests/slangpy_tests`. 6 pre-existing failures in `slangpy/tests/device/` (raytracing pipeline, type conformance cache — unrelated). + +### Review Notes + +**Issues to address before merge:** + +1. **`StructMarshall.can_direct_bind` children branch is dead code.** `calculate_direct_bind()` handles composites directly (when `self.children is not None`) and never calls the marshall's `can_direct_bind`. The `if binding.children is not None:` branch in `StructMarshall.can_direct_bind` is unreachable. Fix: remove the children branch or have `calculate_direct_bind` delegate to the marshall for composites. + +2. **Composite direct-bind should gate on read-only access.** Add `and self.access[0] == AccessType.read` to the composite branch in `calculate_direct_bind()` (matching `ValueRefMarshall` pattern). Without this, a writable dim-0 composite would be incorrectly marked direct-bind. + +3. **Dead `binding.direct_bind` checks in writable ValueRef paths** ([valueref.py](slangpy/builtin/valueref.py) lines ~215, ~230, ~248). Since `can_direct_bind` rejects non-read access, these branches are unreachable. Remove or add `assert not binding.direct_bind` to make the invariant explicit. + +4. **Overly defensive `hasattr` guard** in `calculate_direct_bind()` — `hasattr(self.python, "can_direct_bind")` is unnecessary since `Marshall` base class always defines this method. + +5. **Benchmark file** — `test_benchmark_autograd.py` has accidental local changes that should be reverted. + +6. **C++ improvements** — Add debug assertion in `NativeValueMarshall::ensure_cached` verifying cached `direct_bind` matches binding's; consider making `NativeBoundVariableRuntime.direct_bind` read-only in nanobind. + +**Missing tests to add:** Writable ValueRef inout, `_result` binding flag, all-scalar struct binding flag, struct+WangHashArg child, WangHashArg binding flag, functional read-only ValueRef, bwds binding flags. See parent plan for full table. + +### Design Decisions + +**`direct_bind` lives on `NativeBoundVariableRuntime`, not `NativeValueMarshall`.** The original implementation stored `m_direct_bind` on the marshall itself (`NativeValueMarshall`), but marshalls are shared across calls while bindings are per-call. Moving the flag to the binding makes it immutable per-call and eliminates mutable state on shared marshall instances. + +**Marshall-driven `can_direct_bind` replaces hardcoded type list.** The original `is_direct_bind_eligible` used a lazily-populated `_DIRECT_BIND_TYPES` tuple to check marshall type. The new design uses a virtual method — each marshall opts in explicitly. Adding a new direct-bind-eligible type requires only overriding `can_direct_bind` on the new class. + +**Single `calculate_direct_bind` pass replaces repeated predicate calls.** The original `is_direct_bind_eligible` / `is_direct_bind_recursive` were called multiple times per variable during code gen. The new design computes `direct_bind` once in a single tree pass after `calculate_differentiability`, and consumers simply read the boolean. + +**Children retain `direct_bind` in non-direct-bind composites.** When a composite struct is NOT direct-bind-eligible (e.g., has vectorized children), children **retain** their individual `direct_bind` status. The parent's `gen_call_data_code` delegates to each child's `gen_trampoline_load`/`gen_trampoline_store` — direct-bind children emit direct assignment (e.g., `value.y = y;`) within the parent's `__slangpy_load`, while non-direct-bind children use the standard `__slangpy_load(context.map(...))` path. The old `_clear_direct_bind()` / `_force_no_direct_bind` approach was removed. + +**Writable ValueRef excluded from direct binding.** Writable value refs require buffer allocation, GPU readback, and `__slangpy_store` indirection. Only read-only value refs (`access[0] == AccessType.read`) are direct-bind eligible. This means auto-created `_result` (which is writable) always uses the `RWValueRef` wrapper path. diff --git a/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md new file mode 100644 index 000000000..61f4f08ee --- /dev/null +++ b/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md @@ -0,0 +1,442 @@ +## Phase 2: Eliminate CallData Struct + +**Goal**: Move kernel uniforms out of the `CallData` struct into individual entry-point parameters. Eliminate the trampoline in forward (prim) mode. Fall back to `ParameterBlock` when total inline-uniform size exceeds a runtime per-device threshold. + +**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md) + +**Status**: Steps 2.0–2.2 and 2.4 complete. Step 2.3 (trampoline elimination for prim mode) not started. Code generation logic has been extracted from `callsignature.py` into [generator.py](slangpy/core/generator.py) (see [plan-extractCodegenToGenerator.prompt.md](plan-extractCodegenToGenerator.prompt.md)). + +--- + +### Key Architectural Decisions + +These decisions correct several assumptions in the original plan: + +1. **Entry-point param placement is orthogonal to `direct_bind`.** Any type — wrapped or raw — can be an entry-point parameter (e.g., `uniform ValueType a` or `uniform int a` or `uniform Tensor t`). `direct_bind` governs whether `__slangpy_load`/`__slangpy_store` is needed inside the kernel; entry-point placement governs where the uniform lives in the shader layout. + +2. **Trampoline elimination is independent of `direct_bind`.** The current trampoline body is: declare locals → load (direct assignment or `__slangpy_load`) → call function → store (`__slangpy_store`). All of that can appear directly in `compute_main`. The trampoline only exists because bwds mode needs a `[Differentiable]` wrapper for `bwd_diff()`. In prim mode, it is eliminated regardless of whether args use wrappers. + +3. **All-or-nothing fallback.** When total inline-uniform size exceeds the platform threshold, ALL args go back into `ParameterBlock` (the current path). No hybrid mixing of entry-point params and CallData. + +4. **Shape arrays and `_thread_count` obey the same rules** as user args — they become entry-point params by default, and go into `CallData` on fallback. Phase 2 is NOT scoped only to `call_data_len == 0`. + +5. **Two code paths based on where data lives:** + - **Fast path** (entry-point params): In Slang, uniforms are entry-point parameters and can be used directly (in forward) or passed directly to the trampoline (in backward). + - **Fallback path** (`ParameterBlock`): In Slang, uniforms live in a `CallData` struct. They must be read into local variables before being used (in forward) or passed to the trampoline (in backward). This is the current behavior. + +6. **C++ dispatch changes are isolated to `NativeCallData::exec`.** Marshalls receive a `ShaderCursor` pointing to wherever their data lives — they don't care whether it's inside a `CallData` struct or an entry-point param. In the fast path, `m_runtime->write_shader_cursor_pre_dispatch()` receives the entry-point cursor directly. No marshall code changes needed. + +7. **`CallDataMode` is eliminated.** The `global_data` vs `entry_point` distinction is removed entirely. On the fast path, all backends use entry-point params uniformly. On the fallback path, all backends use `ParameterBlock` — CUDA supports `ParameterBlock` and in practice will never hit the fallback (CUDA's inline-uniform limit is ~4KB). This removes the `CallDataMode` enum, the CUDA-specific `is_entry_point` codegen branch in `callsignature.py`/`generator.py`, and the corresponding C++ branch in `slangpy.cpp`. + +8. **`PackedArg` / param-block types are unchanged.** They stay as `ParameterBlock` at module scope, orthogonal to Phase 2. + +--- + +### Code Organization (post-extraction) + +All code generation logic now lives in [generator.py](slangpy/core/generator.py). [callsignature.py](slangpy/core/callsignature.py) retains binding-pipeline functions (`specialize`, `bind`, `calculate_*`, `estimate_entrypoint_arguments_size`, etc.) and re-exports `generate_code`, `generate_constants`, `KernelGenException` from `generator.py` for backward compatibility. + +| File | Role | +|------|------| +| [generator.py](slangpy/core/generator.py) | All code emission: `generate_code()`, `_emit_trampoline()`, `_emit_entry_point_signature()`, `_emit_kernel_body()`, `_emit_shape_and_metadata_params()`, `_emit_link_time_constants()`, `_emit_call_data_definitions()`, `_emit_trampoline_loads/stores()`, `_data_name()`, `_validate_and_compute_group_shape()`, `gen_call_data_code()`, `gen_calldata_type_name()`, `KernelGenException` | +| [callsignature.py](slangpy/core/callsignature.py) | Binding pipeline: `specialize()`, `bind()`, `apply_explicit_vectorization()`, `apply_implicit_vectorization()`, `finalize_mappings()`, `calculate_differentiability()`, `calculate_direct_binding()`, `estimate_entrypoint_arguments_size()`, `calculate_call_dimensionality()`, `create_return_value_binding()` | +| [calldata.py](slangpy/core/calldata.py) | `CallData` class orchestrating build pipeline; wildcard-imports from `callsignature.py` | +| [codegen.py](slangpy/bindings/codegen.py) | `CodeGen` class with `skip_call_data`, `entry_point_params` attributes | +| [boundvariable.py](slangpy/bindings/boundvariable.py) | `BoundVariable` methods delegate to `gen_call_data_code()` and `gen_calldata_type_name()` in `generator.py` | + +--- + +### Current Kernel Structure (post-Phase 1) + +For `int add(int a, int b)` with scalar args `(1, 2)`: + +```slang +import "module"; +import "slangpy"; + +typealias _t_a = int; // Phase 1: raw type (was ValueType) +typealias _t__result = RWValueRef; // writable _result still wrapped +static const int _m__result = 0; // mapping constant only for _result + +struct CallData { + _t_a a; + _t_a b; + _t__result _result; + uint3 _thread_count; +}; + +void _trampoline(Context __slangpy_context__, CallData __calldata__) { + int a; + a = __calldata__.a; // Phase 1: direct assignment + int b; + b = __calldata__.b; // Phase 1: direct assignment + int _result; + _result = add(a, b); + __calldata__._result.__slangpy_store(__slangpy_context__.map(_m__result), _result); +} + +[shader("compute")] [numthreads(32,1,1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ..., uniform CallData call_data) { + if (any(flat_call_thread_id >= call_data._thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + _trampoline(__slangpy_context__, call_data); +} +``` + +### Target Kernel (Phase 2 fast path, prim mode, all direct-bind) + +```slang +import "module"; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 tid: SV_DispatchThreadID, + uniform uint3 _thread_count, + uniform int a, + uniform int b, + uniform RWStructuredBuffer _result) +{ + if (any(tid >= _thread_count)) return; + _result[0] = add(a, b); +} +``` + +### Target Kernel (Phase 2 fast path, prim mode, mixed direct/non-direct-bind) + +When some args are not direct-bind (e.g., WangHashArg needs per-thread `thread_id` via `__slangpy_load`), the non-direct-bind args still use their wrapper types as entry-point params. Context is needed: + +```slang +import "module"; +import "slangpy"; + +typealias _t_rng = WangHashArgType; // non-direct-bind wrapper type +static const int _m_rng = 0; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, + uniform uint3 _thread_count, + uniform _t_rng rng, + uniform int x, + uniform RWStructuredBuffer _result) +{ + if (any(flat_call_thread_id >= _thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + int _rng_val; + rng.__slangpy_load(__slangpy_context__.map(_m_rng), _rng_val); + int _x_val; + _x_val = x; + int _result_val; + _result_val = func(_rng_val, _x_val); + _result[0] = _result_val; +} +``` + +### Target Kernel (Phase 2 fallback path, prim mode) + +When entry-point param size exceeds the platform limit, all args go into `ParameterBlock`. The trampoline is still eliminated in prim mode — the load/call/store is inlined into `compute_main`, reading from `call_data`: + +```slang +import "module"; +import "slangpy"; + +typealias _t_a = int; +typealias _t__result = RWValueRef; +static const int _m__result = 0; + +struct CallData { + _t_a a; + _t_a b; + _t__result _result; + uint3 _thread_count; +}; +ParameterBlock call_data; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ...) { + if (any(flat_call_thread_id >= call_data._thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + int a; + a = call_data.a; + int b; + b = call_data.b; + int _result; + _result = add(a, b); + call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result); +} +``` + +--- + +### Step 2.0: Gating tests ✅ + +**Status: DONE** + +Tests added to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py). All 21 parametrized cases (7 tests × 3 device types) pass. + +| Test | Source | Args | Original assertion | Status | +|------|--------|------|--------------------|--------| +| `test_gate_p2_calldata_struct_present` | `int add(int a, int b)` | `(1, 2)` | `struct CallData` in code | ✅ Flipped — now asserts `struct CallData` ABSENT (Step 2.2 done) | +| `test_gate_p2_calldata_uniform_param` | same | same | `uniform CallData call_data` or `ParameterBlock` | ✅ Flipped — now asserts both ABSENT (Step 2.2 done) | +| `test_gate_p2_thread_count_in_calldata` | same | same | `call_data._thread_count` | ✅ Flipped — now asserts ABSENT (Step 2.2 done) | +| `test_gate_p2_trampoline_present_for_prim` | same | same | `void _trampoline(` present | Still asserts present (Step 2.3 pending) | +| `test_gate_p2_kernel_calls_trampoline` | same | same | `_trampoline(` in `compute_main` body | Still asserts present (Step 2.3 pending) | +| `test_gate_p2_sv_group_id_present` | same | same | `SV_GroupID` in `compute_main` signature | ✅ Flipped — now asserts ABSENT for dim-0 calls (Step 2.2 done) | + +Negative gates (must stay passing after Phase 2): + +| Test | Asserts | +|------|---------| +| `test_gate_p2_wanghasharg_keeps_load` | Non-direct-bind arg still uses `__slangpy_load` | + +Bwds gates: + +| Test | Status | +|------|--------| +| `test_gate_scalar_uses_valuetype` | ✅ Passing — asserts fast-path trampoline with `__in_` prefix params | +| `test_gate_bwds_scalar_uses_valuetype` | ✅ Passing — bwds trampoline has `no_diff` on all params (Step 2.4 done) | + +--- + +### Step 2.1: Determine fast vs fallback path ✅ + +**Status: DONE** + +In [slangpy/core/calldata.py](slangpy/core/calldata.py), after `calculate_direct_binding(bindings)`: + +1. **Query a runtime per-device threshold** for max entry-point parameter inline-uniform size. This is a property of the device/backend — large for D3D12/CUDA (thousands of bytes), potentially as low as 128–256 bytes on Vulkan. +2. **Accumulate inline-uniform byte size** of each bound variable's `calldata_type_name`, plus `_thread_count` (12 bytes) and shape arrays (`call_data_len * 3 * sizeof(int)` for `_grid_stride`, `_grid_dim`, `_call_dim`). **Resource types** (`RWStructuredBuffer`, `Texture2D`, `TensorView`, etc.) don't count — they are bound as descriptors, not inline data. +3. **Decision**: If total size ≤ threshold → `self.use_entrypoint_args = True` (fast path). Otherwise → `self.use_entrypoint_args = False` (fallback path — current behavior). +4. **Store** `use_entrypoint_args` on the `CallData` instance and propagate to C++ `NativeCallData`. + +`PackedArg` / param-block types are excluded from this accounting — they stay as `ParameterBlock` regardless. + +**Implementation details:** + +- `DeviceLimits.max_entry_point_uniform_size` added to C++ struct ([device.h](src/sgl/device/device.h)) with per-backend defaults: Vulkan=128, D3D12=256, CUDA=4096 bytes ([device.cpp](src/sgl/device/device.cpp)). +- `estimate_entrypoint_arguments_size()` in [callsignature.py](slangpy/core/callsignature.py) — sums `vector_type.uniform_layout.size` for each depth-0 bound variable (skipping `PackedArg`), plus 12 bytes for `_thread_count` and `call_dimensionality * 4 * 3` for shape arrays. +- `use_entrypoint_args` property added to `NativeCallData` C++ class ([slangpy.h](src/slangpy_ext/utils/slangpy.h)) with Python binding. +- `CallData.__init__()` in [calldata.py](slangpy/core/calldata.py) sets `self.use_entrypoint_args = inline_size <= threshold` after `calculate_direct_binding()`. + +**Tests** (7 tests × 3 device types = 21 parametrized cases, all pass): + +| Test | Asserts | +|------|---------| +| `test_step21_scalar_uses_entrypoint_args` | Simple `int add(int,int)` with `(1,2)` → `use_entrypoint_args=True` | +| `test_step21_threshold_property_positive` | `device.info.limits.max_entry_point_uniform_size > 0` | +| `test_step21_vector_uses_entrypoint_args` | `float3` args → `use_entrypoint_args=True` | +| `test_step21_struct_uses_entrypoint_args` | All-scalar struct dict → `use_entrypoint_args=True` | +| `test_step21_tensor_uses_entrypoint_args` | Tensor (descriptor-only, 0 inline bytes) → `use_entrypoint_args=True` | +| `test_step21_many_float4x4_may_exceed_vulkan` | 8×float4x4 (524 bytes) exceeds Vulkan/D3D12 thresholds, not CUDA | +| `test_step21_wanghasharg_uses_entrypoint_args` | Non-direct-bind WangHashArg with small inline size → `use_entrypoint_args=True` | + +--- + +### Step 2.2: Code generation — entry-point params (fast path) ✅ + +**Status: DONE** + +In [generator.py](slangpy/core/generator.py) `generate_code()` (line 778), when `use_entrypoint_args == True`: + +**CodeGen changes** in [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py): +- `self.skip_call_data: bool = False` — when `True`, don't emit `struct CallData` / `begin_block()` and gate `end_block()` in `finish()`. +- `self.entry_point_params: list[str] = []` — collects individual uniform param declarations. +- `finish()` ignores the `call_data` block and `use_param_block_for_call_data` when `skip_call_data` is set. + +**CallData struct elimination**: `generate_code()` sets `cg.skip_call_data = True` when `use_entrypoint_args`. No `struct CallData` emitted. + +**`_emit_call_data_code`** in [generator.py](slangpy/core/generator.py#L345): At `depth == 0`, when `use_entrypoint_args`, appends to `cg.entry_point_params` instead of `cg.call_data.declare(...)`. The `call_data_structs` block (type aliases, wrapper structs, mapping constants) still gets emitted at module scope. + +**`_thread_count` and shape arrays**: `_emit_shape_and_metadata_params()` ([generator.py](slangpy/core/generator.py#L466)) appends to `cg.entry_point_params` instead of `cg.call_data`. Same for `_grid_stride`, `_grid_dim`, `_call_dim` when `call_data_len > 0`. + +**Entry-point signature**: `_emit_entry_point_signature()` ([generator.py](slangpy/core/generator.py#L652)) emits `compute_main` signature as: +```slang +void compute_main( + int3 flat_call_thread_id: SV_DispatchThreadID, + [int3 flat_call_group_id: SV_GroupID,] // only when call_data_len > 0 + [int flat_call_group_thread_id: SV_GroupIndex,] // only when call_data_len > 0 + uniform uint3 _thread_count, + [uniform int[N] _grid_stride, ...] // only when call_data_len > 0 + uniform _t_a a, + uniform _t_b b, + uniform _t__result _result +) +``` + +Drop `SV_GroupID` and `SV_GroupIndex` when `call_data_len == 0` — they feed `init_thread_local_call_shape_info` which isn't called when there are no shape arrays. + +**Bounds check**: Changes from `call_data._thread_count` to just `_thread_count`. + +**Shape info init**: Changes from `call_data._grid_stride` etc. to just `_grid_stride`, `_grid_dim`, `_call_dim`. + +**Fallback path** (`use_entrypoint_args == False`): `struct CallData` is emitted with `ParameterBlock call_data` at module scope on ALL backends (including CUDA). The old `CallDataMode` distinction between `entry_point` (CUDA) and `global_data` (non-CUDA) is removed — `ParameterBlock` works on CUDA, and in practice CUDA will never hit the fallback due to its large (~4KB) inline-uniform limit. + +See [slangpy/tests/device/test_pipeline_utils.slang](slangpy/tests/device/test_pipeline_utils.slang) for examples of manually-written compute shaders that use entry point parameters on all backends (CUDA, Vulkan, D3D12). + +--- + +### Step 2.3: Trampoline elimination for prim mode + +**Status: NOT STARTED** — Trampoline is still generated for prim mode on both paths. The load/call/store sequence needs to be inlined into `compute_main`. + +When `call_mode == prim` — on **both** fast and fallback paths: + +- Don't generate the `_trampoline` function. +- Inline the load/call/store sequence directly into `compute_main` after the bounds check and (if needed) Context construction. +- The load/call/store codegen reuses the same logic currently in `_emit_trampoline_loads()` ([generator.py](slangpy/core/generator.py#L528)) and `_emit_trampoline_stores()` ([generator.py](slangpy/core/generator.py#L551)), but emitted into `cg.kernel` instead of `cg.trampoline` with adjusted `data_name` from `_data_name()` ([generator.py](slangpy/core/generator.py#L513)): + +| Path | `data_name` for non-param-block args | +|------|-------------------------------------| +| Fast | `x.variable_name` (entry-point param name directly) | +| Fallback | `call_data.{x.variable_name}` (global `ParameterBlock`, all backends) | +| Param blocks | `_param_{x.variable_name}` (unchanged) | + +**Context construction**: Needed only when any arg is non-direct-bind (i.e., calls `__slangpy_load`/`__slangpy_store`). When all args satisfy `direct_bind == True`, skip Context construction entirely — no `Context __slangpy_context__` declaration, no `import "slangpy"`. + +**Key functions to modify in [generator.py](slangpy/core/generator.py)**: +- `_emit_trampoline()` (line 578): gate on `call_mode != prim` — only emit for bwds mode. +- `_emit_kernel_body()` (line 708): when prim mode, inline the load/call/store sequence directly instead of calling `_trampoline()`. +- `generate_code()` (line 778): skip `_emit_trampoline()` call when prim mode. + +**Note**: The trampoline elimination does NOT depend on `direct_bind`. Even non-direct-bind args with `__slangpy_load` work inline in `compute_main` — the `__slangpy_load` call just needs the data reference and a `Context` value, both available in `compute_main`. + +--- + +### Step 2.4: Trampoline with individual params for bwds mode ✅ + +**Status: DONE** — Fast-path trampoline takes individual params with `no_diff` on all params. All 3 device types pass. + +When `call_mode == bwds`: + +- Still generate a `[Differentiable]` trampoline function via `_emit_trampoline()` ([generator.py](slangpy/core/generator.py#L578)). +- **Fast path**: Trampoline takes individual params instead of a struct. All params get `no_diff` — entry-point uniforms are never differentiable. Differentiation happens through local variable assignments inside the trampoline body, matching the struct-based approach where `CallData` was implicitly non-differentiable. No `in`/`out`/`inout` modifiers are added — `compute_main` passes its uniforms straight through: + ```slang + [Differentiable] + void _trampoline(Context __slangpy_context__, no_diff float __in_a, no_diff float __in_b, no_diff NoneType __in__result) + ``` + `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, a, b, _result)` passing entry-point param names directly. +- **Fallback path**: Trampoline reads from global `ParameterBlock call_data` as it does today (on all backends). `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, call_data)`. +- `_gen_trampoline_argument()` in `boundvariable.py` remains unused dead code — the inline generation in `_emit_trampoline()` ([generator.py](slangpy/core/generator.py#L578)) is simpler and avoids the `in`/`out`/`inout` modifiers that caused Slang autodiff errors. + +**Key insight**: Adding `in`/`out`/`inout` modifiers to trampoline params caused Slang autodiff issues (e.g., `out` params get reversed to `in` by `bwd_diff`, changing arity). The trampoline params are just pass-through uniforms — all data flow logic (loads, stores, differentiation) is handled internally via local variables. + +--- + +### Step 2.5: C++ dispatch changes ✅ + +**Status: DONE** — `CallDataMode` enum fully removed. Fast path uses `find_entry_point(0)` on all backends. Fallback path uses global `ParameterBlock` on all backends. + +In [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), store `m_use_entrypoint_args` on `NativeCallData` (received from Python `CallData`). Also add to [slangpy.h](src/slangpy_ext/utils/slangpy.h). + +Modify `bind_call_data` lambda in `exec()`: + +**Fast path** (`m_use_entrypoint_args == true`): +- All backends: Navigate via `cursor.find_entry_point(0)`. This is the entry-point cursor. +- Write `_thread_count` as an entry-point param: `entry_point_cursor["_thread_count"]`. +- Write shape arrays as entry-point params: `entry_point_cursor["_grid_stride"]`, etc. +- Pass `entry_point_cursor` as the `call_data_cursor` argument to `m_runtime->write_shader_cursor_pre_dispatch()`. Each `NativeBoundVariableRuntime` already navigates `cursor[m_variable_name]`, so it finds the entry-point param by name automatically. **No marshall code changes needed.** +- Cache entry-point param field indices on first call (analogous to existing `m_cached_call_data_offsets`). +- The `reserve_data` + raw-pointer optimization for `_thread_count` and shape arrays may not work for individual entry-point params at disjoint offsets. Use cursor-based writes for these metadata fields (they're small, performance impact minimal), or check if `reserve_data` still works across the entry-point shader object. + +**Fallback path** (`m_use_entrypoint_args == false`): +- All backends: Navigate to global `call_data` field via `cursor.find_field("call_data")`, dereference (it's a `ParameterBlock`), write struct data. The old `CallDataMode` branch (CUDA using `find_entry_point(0)` for call_data) is removed. Remove `m_call_data_mode`, `CallDataMode` enum, and all associated branches from `slangpy.h`, `slangpy.cpp`, `calldata.py`, and `callsignature.py`. + +--- + +### Step 2.6: `_result` handling + +**Status: NOT STARTED** + +Auto-created `_result` is a writable `ValueRef`, currently NOT direct-bind eligible (needs `RWValueRef` wrapper with buffer logic). Phase 2 handles this differently on the two paths: + +**Fast path**: `_result` is emitted as `uniform RWValueRef _result` on the entry point. In prim mode, the inlined code stores via `_result.__slangpy_store(...)`. In the all-direct-bind case where Context is omitted, add a new code path: emit `uniform RWStructuredBuffer _result` with `_result[0] = value` for the store. This requires `ValueRefMarshall` to support writable direct-bind for the entry-point-param case specifically, using `RWStructuredBuffer` instead of `RWValueRef`. + +**Fallback path**: `_result` stays as `RWValueRef` inside `CallData`, same as current behavior. + +**Implementation note**: The `RWStructuredBuffer` approach for `_result` is only used when `use_entrypoint_args == True` AND all other args are direct-bind (so Context can be omitted). When non-direct-bind args are present, Context exists and `_result` can continue to use `RWValueRef.__slangpy_store(context, value)`. + +--- + +### Step 2.7: Tests + +**Status: NOT STARTED** + +**Post-implementation tests** — should pass AFTER Phase 2 is complete: + +| Test | Verifies | +|------|----------| +| `test_phase2_no_calldata_struct` | `struct CallData` absent for eligible call | +| `test_phase2_uniform_params_on_entry` | Individual `uniform` params on `compute_main` | +| `test_phase2_no_trampoline_prim` | No `void _trampoline(` for prim-mode calls | +| `test_phase2_inline_call` | Function call inlined directly in `compute_main` | +| `test_phase2_thread_count_as_uniform` | `uniform uint3 _thread_count` as entry-point param | +| `test_phase2_no_context_all_direct` | No `Context __slangpy_context__` when all args direct-bind | +| `test_phase2_context_kept_non_direct` | `Context` present when some args use `__slangpy_load` | +| `test_phase2_bwds_trampoline_individual` | Bwds trampoline has individual params with `no_diff` | +| `test_phase2_bwds_bwd_diff_call` | `bwd_diff(_trampoline)(ctx, a, b, ...)` in kernel | +| `test_phase2_no_sv_group_when_dim0` | No `SV_GroupID`/`SV_GroupIndex` when `call_data_len == 0` | +| `test_phase2_sv_group_when_vectorized` | `SV_GroupID`/`SV_GroupIndex` present when `call_data_len > 0` | +| `test_phase2_fallback_keeps_calldata` | Force fallback → `struct CallData` still emitted | +| `test_phase2_fallback_no_trampoline_prim` | Even fallback path eliminates trampoline in prim mode | +| `test_phase2_functional_scalar_add` | `add(1, 2) == 3` end-to-end dispatch | +| `test_phase2_functional_bwds` | Backward pass correct gradients | +| `test_phase2_functional_vectorized` | Vectorized call (shapes) with entry-point params | +| `test_phase2_functional_mixed_direct` | Mix of direct-bind + non-direct-bind args | + +--- + +### Implementation Order + +1. **Step 2.0** ✅ — Gating tests (baseline documentation) +2. **Step 2.1** ✅ — Fast/fallback determination + size query +3. **Step 2.2 + 2.5** ✅ — Code gen + C++ dispatch for entry-point params + `CallDataMode` removal (landed together) +4. **Step 2.4** ✅ — Bwds trampoline with individual params (fast path) — `no_diff` on all params +5. **Step 2.3** — Trampoline elimination for prim mode (both paths) +6. **Step 2.6** — `_result` as `RWStructuredBuffer` for all-direct-bind case +7. **Step 2.7** — Post-implementation tests + functional tests + +**Note:** Implementation order deviated from original plan — Steps 2.2 + 2.5 were done before 2.3 (trampoline elimination), combined with `CallDataMode` removal. Step 2.4 done — all trampoline params use `no_diff` without IO modifiers. + +--- + +### Key Files + +| File | Changes | +|------|---------| +| [slangpy/core/generator.py](slangpy/core/generator.py) | ✅ All code generation logic extracted here from `callsignature.py`. `generate_code()` orchestrator (line 778), `_emit_trampoline()` (line 578), `_emit_entry_point_signature()` (line 652), `_emit_kernel_body()` (line 708), `_emit_shape_and_metadata_params()` (line 466), `_emit_link_time_constants()` (line 424), `_emit_call_data_definitions()` (line 503), `_emit_trampoline_loads/stores()`, `_data_name()`, `_validate_and_compute_group_shape()`, `gen_call_data_code()`, `gen_calldata_type_name()`, `KernelGenException`. Entry-point params fast/fallback code paths. Bwds `no_diff` on all trampoline params (Step 2.4). Trampoline still generated for prim mode (Step 2.3 pending). | +| [slangpy/core/calldata.py](slangpy/core/calldata.py) | ✅ `use_entrypoint_args` flag, size threshold check, `CallDataMode` removed | +| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | ✅ Binding-pipeline functions only (`specialize`, `bind`, `calculate_*`, `estimate_entrypoint_arguments_size`). Re-exports `generate_code`, `generate_constants`, `KernelGenException` from `generator.py`. | +| [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py) | ✅ `skip_call_data` flag, `entry_point_params` list | +| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | ✅ `gen_call_data_code` and `gen_calldata_type_name` delegate to `generator.py`. `_gen_trampoline_argument()` unused dead code. | +| [slangpy/bindings/marshall.py](slangpy/bindings/marshall.py) | ✅ `use_entrypoint_args` field on `BindContext`, `CallDataMode` removed | +| [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) | ✅ `use_entrypoint_args` binding; `bind_call_data` fast path via `find_entry_point(0)`, `CallDataMode` branches removed | +| [src/slangpy_ext/utils/slangpy.h](src/slangpy_ext/utils/slangpy.h) | ✅ `m_use_entrypoint_args` on `NativeCallData`; `m_call_data_mode` removed | +| [src/sgl/device/device.h](src/sgl/device/device.h) | ✅ `max_entry_point_uniform_size` on `DeviceLimits` | +| [src/sgl/device/device.cpp](src/sgl/device/device.cpp) | ✅ Per-backend defaults for `max_entry_point_uniform_size` | +| [src/slangpy_ext/device/device.cpp](src/slangpy_ext/device/device.cpp) | ✅ Python binding for `max_entry_point_uniform_size` | +| [src/sgl/utils/slangpy.h](src/sgl/utils/slangpy.h) | ✅ `CallDataMode` enum removed | +| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | ✅ `CallDataMode` removed; imports `generate_constants` from `generator.py` | +| [slangpy/core/packedarg.py](slangpy/core/packedarg.py) | ✅ `CallDataMode` removed | +| [slangpy/core/function.py](slangpy/core/function.py) | ✅ `CallDataMode` removed from imports | +| [slangpy/slangpy/__init__.pyi](slangpy/slangpy/__init__.pyi) | ✅ `CallDataMode` class and `call_data_mode` property removed | +| [slangpy/tests/slangpy_tests/test_type_resolution.py](slangpy/tests/slangpy_tests/test_type_resolution.py) | ✅ `CallDataMode` removed from `BindContext` creation | +| [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) | ✅ Gating tests + Step 2.1 tests updated for new behavior; post-implementation tests (Step 2.7) pending | + +--- + +### Verification + +```bash +# Build first (required) +cmake --build --preset windows-msvc-debug + +# Run kernel gen tests +$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v + +# Run full test suite +pytest slangpy/tests -v + +# Run pre-commit +pre-commit run --all-files +``` diff --git a/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md new file mode 100644 index 000000000..8736dca4a --- /dev/null +++ b/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md @@ -0,0 +1,48 @@ +## Phase 3: Direct Compute Kernel Invocation + +**Goal**: When the user's Slang function is ALREADY a `[shader("compute")]` entry point (or can trivially be one), skip kernel generation entirely and dispatch the pre-written shader directly. + +**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md) + +--- + +### Step 3.1: Detection + +In the function resolution phase, detect when the target Slang function: +- Has `[shader("compute")]` attribute +- Has parameter types that SlangPy can bind directly (uniforms, buffers, textures) +- Has explicit thread count specified by the user (already supported via `function.set_thread_count()`) + +--- + +### Step 3.2: Direct dispatch path + +When eligible: +- Skip Phase 2 (kernel generation) entirely +- Create a `ComputePipeline` directly from the user's shader +- Map Python arguments to entry point parameters using the type marshalling but without code generation +- Dispatch directly + +--- + +### Step 3.3: Argument binding + +Leverage Phase 2's per-argument binding infrastructure — the same cursor write logic that writes individual uniform params would write to the pre-written shader's entry point params. + +--- + +### Step 3.4: Tests + +**Gating test** — assert CURRENT behavior so it breaks when Phase 3 is implemented: + +| Test | Slang Source | Args | Asserts (current behavior) | Breaks when | +|------|-------------|------|---------------------------|-------------| +| `test_gate_compute_shader_generates_wrapper` | Source with `[shader("compute")] void my_kernel(...)` function, test calls a helper function in the same module | N/A | SlangPy generates its own `compute_main` wrapper; user's `[shader("compute")]` is ignored | Step 3.1 | + +**Post-implementation tests** — should pass AFTER Phase 3 is complete: + +- `test_phase3_direct_dispatch`: dispatch a pre-written `[shader("compute")]` kernel directly, verify no wrapper generated +- `test_phase3_requires_thread_count`: verify error when thread count not specified +- `test_phase3_scalar_params`: verify scalar uniform params bind correctly +- `test_phase3_buffer_params`: verify `RWStructuredBuffer` params bind correctly +- `test_phase3_texture_params`: verify texture params bind correctly diff --git a/.github/prompts/plan-simplifyKernelGen.prompt.md b/.github/prompts/plan-simplifyKernelGen.prompt.md new file mode 100644 index 000000000..a21cee212 --- /dev/null +++ b/.github/prompts/plan-simplifyKernelGen.prompt.md @@ -0,0 +1,253 @@ +## Plan: Simplify Generated SlangPy Kernels + +**TL;DR**: A three-phase effort to make generated kernels resemble hand-written GPU code. Phase 1 adds direct type marshalling (bypassing `ValueType` wrappers and `__slangpy_load`/`__slangpy_store`) for dim-0 non-composite types, following the pattern already used by `TensorView`. Phase 2 eliminates the `CallData` struct when all arguments are direct-eligible, passing them as individual uniforms/globals. Phase 3 enables calling pre-written compute kernels directly without generating wrapper shaders. + +**Target example** — `add(int a, int b) -> int` with scalar args should go from 40+ lines of boilerplate to approximately: + +```slang +import "module"; +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 tid: SV_DispatchThreadID, uniform uint3 _thread_count, uniform int a, uniform int b, uniform RWStructuredBuffer _result) +{ + if (any(tid >= _thread_count)) return; + _result[0] = add(a, b); +} +``` + +--- + +### Phase Plans + +- [Phase 1: Direct Type Marshalling](plan-simplifyKernelGen-phase1.prompt.md) — **✅ merged (PR #863)** +- [Phase 2: Eliminate CallData Struct](plan-simplifyKernelGen-phase2.prompt.md) — **not started** +- [Phase 3: Direct Compute Kernel Invocation](plan-simplifyKernelGen-phase3.prompt.md) — **not started** + +--- + +### Phase 1 Summary (Complete — PR #863) + +Phase 1 introduced **direct binding**: dim-0 arguments that can be bound using raw Slang types instead of `ValueType` wrappers, eliminating `__slangpy_load`/`__slangpy_store` indirection, `Context.map()` calls, and mapping constants for eligible arguments. PR #863 was merged to `main` on 2026-03-11 (+2,044 / −122 lines, 18 files changed, squash-merged). + +#### What Phase 1 Changed + +**Architecture**: A marshall-driven `can_direct_bind(binding)` virtual method (default `False`) combined with a single depth-first `calculate_direct_bind()` pass on the `BoundVariable` tree. This follows the same pattern as `calculate_differentiability`. The `direct_bind` boolean is stored on `BoundVariable` (Python) and propagated to `NativeBoundVariableRuntime` (C++). + +**Eligibility**: A variable is direct-bind eligible if: +- `call_dimensionality == 0` (not vectorized) +- Not composite with children (unless all children are also direct-bind AND the composite is dim-0 with a concrete Slang struct type and read-only access) +- Not a param block (`PackedArg`) +- The marshall opts in via `can_direct_bind()` override +- For `ValueRefMarshall`: `access[0] == AccessType.read` (writable value refs need buffer logic) + +**Code generation effects** — when `binding.direct_bind == True`: +- `gen_calldata` emits `typealias _t_{name} = {raw_slang_type}` instead of `ValueType` / `VectorValueType` / `ValueRef` +- `gen_trampoline_load` emits `{value_name} = {data_name};` (direct assignment) instead of `{data_name}.__slangpy_load(context.map(_m_{name}), {name})` +- `gen_trampoline_store` returns `True` (suppresses store for read-only types) +- Mapping constants (`static const int _m_{name} = 0`) are skipped +- `create_calldata` returns the raw value instead of `{"value": data}` + +**C++ fast path**: `NativeValueMarshall::ensure_cached` reads `binding->direct_bind()` to decide cursor navigation — `cursor[variable_name]` for direct-bind vs `cursor[variable_name]["value"]` for wrapper path. + +**Composite (struct/dict) handling**: When `calculate_direct_bind()` visits a composite, it recurses children first. If all children are direct-bind AND the composite is dim-0 with a concrete vector type and read-only access → the composite itself is direct-bind (emits raw `typealias`). Otherwise the composite is NOT direct-bind, but children **retain** their individual `direct_bind` status — the parent's `__slangpy_load`/`__slangpy_store` body uses `gen_trampoline_load`/`gen_trampoline_store` for each child, so direct-bind children get direct assignment (e.g., `value.y = y;`) while non-direct-bind children use `__slangpy_load(context.map(...))`. + +**API changes to `gen_trampoline_load`/`gen_trampoline_store`**: Signature changed from `(cgb, binding, is_entry_point)` → `(cgb, binding, data_name, value_name)`. The caller now computes `data_name` (e.g., `__calldata__.x` or `call_data.x`) and `value_name` (e.g., `x` or `value.x`), allowing these methods to work both at the root trampoline level and inside composite `__slangpy_load`/`__slangpy_store` bodies. + +**`read_output` fix** (C++): `NativeBoundVariableRuntime::read_output` was simplified — composites no longer attempt to read output directly (it is handled by their children). The old composite branch had a logic error (checking `res.contains(name)` before insertion). + +#### Control Flow (post-Phase 1) + +``` +CallData.build() + → calculate_differentiability(context, bindings) + → calculate_direct_binding(bindings) ← Phase 1 + → generate_code(...) + → gen_call_data_code() — reads binding.direct_bind + → gen_trampoline() — reads binding.direct_bind + → BoundCallRuntime(bindings) — propagates binding.direct_bind to C++ runtime +``` + +At dispatch time, `NativeValueMarshall::ensure_cached()` reads `binding->direct_bind()` to decide cursor navigation: +- `direct_bind == false`: `cursor[variable_name]["value"]` (wrapper path) +- `direct_bind == true`: `cursor[variable_name]` (raw type path) + +#### Implemented Steps + +| Step | Status | Summary | +|------|--------|---------| +| 1.1 | ✅ Done | `Marshall.can_direct_bind(binding)` virtual method. `can_direct_bind_common(binding)` helper. `BoundVariable.calculate_direct_bind()` depth-first tree pass. `calculate_direct_binding(call)` in `callsignature.py`. | +| 1.2 | ✅ Done | `ValueMarshall`: `can_direct_bind`, `gen_calldata`, `gen_trampoline_load/store` read `binding.direct_bind`. | +| 1.2a | ✅ Done | C++ fast path: `NativeValueMarshall::ensure_cached` reads `binding->direct_bind()` from `NativeBoundVariableRuntime`. `m_direct_bind` **removed** from `NativeValueMarshall`. | +| 1.3 | ✅ Done | `VectorMarshall`/`MatrixMarshall`/`ArrayMarshall`: inherit from `ValueMarshall`. `VectorMarshall.gen_calldata` emits raw vector type (e.g., `vector`). | +| 1.4 | ✅ Done | `StructMarshall`: `can_direct_bind` checks all children. `BoundVariable.gen_call_data_code` uses `self.direct_bind`. Non-direct-bind composites delegate to children's `gen_trampoline_load/store`. | +| 1.5 | ✅ Done | `ValueRefMarshall`: `can_direct_bind` requires `access[0] == AccessType.read`. Writable value refs (including auto-created `_result`) use `RWValueRef`. | +| 1.6 | ✅ Done | Tensor dim-0: `can_direct_bind` added to `tensorcommon.py`. `gen_trampoline_load/store` extended for dim-0 tensors (`ITensorType`, `TensorViewType`, `DiffTensorViewType`). | +| 1.7 | ✅ Done | Mapping constants (`static const int _m_{name}`) skipped when `self.direct_bind`. | +| 1.8 | ⬜ Deferred | Autodiff derivative fields still use `ValueType` wrappers. Bwds primals use direct bind. | +| 1.9 | ✅ Done | 77 tests (×3 device types = 231 cases). All pass on d3d12/vulkan/cuda. | + +#### Files Modified (PR #863) + +| File | Changes | +|------|---------| +| `src/slangpy_ext/utils/slangpy.h` | `m_direct_bind` member, `direct_bind()`, `set_direct_bind()` on `NativeBoundVariableRuntime` | +| `src/slangpy_ext/utils/slangpy.cpp` | Nanobind `direct_bind` r/w property on `NativeBoundVariableRuntime`. `read_output` composite branch simplified. | +| `src/slangpy_ext/utils/slangpyvalue.h` | `CachedValueWrite.direct_bind` field added. `m_direct_bind`/`direct_bind()`/`set_direct_bind()` **removed** from `NativeValueMarshall`. | +| `src/slangpy_ext/utils/slangpyvalue.cpp` | `ensure_cached` reads `binding->direct_bind()` for cursor path; caches `direct_bind` value. | +| `slangpy/bindings/marshall.py` | `can_direct_bind(binding)` virtual method (default `False`). `gen_trampoline_load/store` signature changed to `(cgb, binding, data_name, value_name)`. | +| `slangpy/bindings/boundvariable.py` | `can_direct_bind_common()` helper. `BoundVariable.direct_bind` attribute. `BoundVariable.calculate_direct_bind()` method. `gen_call_data_code` handles direct-bind composites (raw typealias) and delegates to children's `gen_trampoline_load/store`. Mapping constant emission gated on `not self.direct_bind`. | +| `slangpy/bindings/boundvariableruntime.py` | `self.direct_bind = source.direct_bind` propagation to C++ runtime. | +| `slangpy/bindings/__init__.py` | Exports `can_direct_bind_common`. | +| `slangpy/core/callsignature.py` | `calculate_direct_binding(call)` function. Trampoline code gen refactored: `data_name` computed before `gen_trampoline_load` call. Store path moved after `data_name` computation. | +| `slangpy/core/calldata.py` | `calculate_direct_binding(bindings)` call after `calculate_differentiability`. `self.code = code` stored for debugging. | +| `slangpy/builtin/value.py` | `can_direct_bind`, `gen_trampoline_load`, `gen_trampoline_store` added. `gen_calldata` gates on `binding.direct_bind`. | +| `slangpy/builtin/valueref.py` | `can_direct_bind` (read-only gate), `gen_trampoline_load`, `gen_trampoline_store` added. `gen_calldata`, `create_calldata`, `read_calldata` gate on `binding.direct_bind`. `self._direct_bind` removed. | +| `slangpy/builtin/struct.py` | `can_direct_bind` (children check + `AccessType.read` gate). `gen_trampoline_load`, `gen_trampoline_store` delegate to `ValueMarshall` when direct-bind. | +| `slangpy/builtin/tensor.py` | `can_direct_bind` delegates to `tensorcommon`. `gen_trampoline_load/store` signature updated. | +| `slangpy/builtin/tensorcommon.py` | `can_direct_bind()` function added. `gen_trampoline_load/store` signature changed, condition changed from `isinstance(vector_type, TensorViewType)` to `binding.direct_bind`. | +| `slangpy/torchintegration/torchtensormarshall.py` | `can_direct_bind` delegates to `tensorcommon`. `gen_trampoline_load/store` signature updated. | +| `slangpy/benchmarks/test_benchmark_autograd.py` | Removed accidental blank line (1-line whitespace change). | +| `slangpy/tests/slangpy_tests/test_kernel_gen.py` | New file: 77 tests covering all Phase 1 scenarios. | + +#### Test Coverage Summary + +The test file (`test_kernel_gen.py`) provides 77 test functions × 3 device types = 231 parametrized cases covering: + +**Code-gen assertion tests** (`test_gate_*`): Verify generated Slang code patterns — type aliases, trampoline load/store statements, mapping constants, wrapper types, `__slangpy_load`/`__slangpy_store` presence/absence. + +**Binding flag tests**: Verify `direct_bind`, `call_dimensionality`, and `vector_type` on `BoundVariable` instances for: scalars, vectors, tensors (dim-0 and vectorized), structs (all-scalar, mixed, nested, deeply nested), writable ValueRef, auto-created `_result`, WangHashArg, bwds primal args. + +**Functional GPU dispatch tests** (`test_phase1_functional_*`): End-to-end dispatch verifying correct GPU results for: scalar add/mul, vector scale, struct sum, ValueRef write, mixed scalar+tensor, mixed struct fields, tensor dim-0, 2D/3D tensor→vector, 2D tensor→scalar, 2D tensor→array, nested/deeply-nested structs, struct with matrix/vector/array fields, struct return types, struct with vectorized 2D tensor child. + +**Negative gates** (`test_gate_*_keeps_*`): Verify types that are NOT direct-bind eligible remain using wrappers: WangHashArg, vectorized scalar (dim > 0), vectorized dict. + +**Helper infrastructure**: `assert_contains`, `assert_not_contains`, `assert_trampoline_has`, `generate_code`, `generate_bwds_code`. + +#### Known Issues (from review, not yet addressed) + +1. **`set_direct_bind` exposed as read-write nanobind property** — After first dispatch, mutating `direct_bind` would invalidate the cached cursor offset. Consider making it read-only. + +2. **C++ cache safety** — `NativeValueMarshall::ensure_cached` caches `direct_bind` but has no debug assertion verifying it matches on subsequent calls. + +3. **Dead `binding.direct_bind` checks in writable ValueRef paths** — `create_calldata` and `read_calldata` in `valueref.py` have `assert not binding.direct_bind` in writable code paths (reachable only as assertions, since `can_direct_bind` rejects non-read access). + +--- + +### What Phase 2 Needs to Know + +Phase 2 builds on Phase 1's `direct_bind` infrastructure. Key context for implementation: + +**Current kernel structure** (post-Phase 1, for `int add(int a, int b)` with args `(1, 2)`): +```slang +import "module"; +import "slangpy"; +// CallData struct with per-arg type aliases and mapping constants +struct CallData { + typealias _t_a = int; // Phase 1: raw type (was ValueType) + _t_a a; + typealias _t_b = int; // Phase 1: raw type (was ValueType) + _t_b b; + typealias _t__result = RWValueRef; // writable _result still wrapped + _t__result _result; + static const int _m__result = 0; // mapping constant only for _result + uint3 _thread_count; + // ... shape arrays if call_data_len > 0 ... +}; +void _trampoline(CallData call_data /*or __calldata__ on CUDA*/) { + int a; + a = call_data.a; // Phase 1: direct assignment (was __slangpy_load) + int b; + b = call_data.b; // Phase 1: direct assignment + int _result; + _result = add(a, b); + call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result); +} +[shader("compute")] [numthreads(32,1,1)] +void compute_main(..., uniform CallData call_data) { + // thread bounds check, context construction + _trampoline(call_data); +} +``` + +**Phase 2 goal**: Eliminate the `CallData` struct entirely when ALL args are direct-bind eligible. Pass args as individual `uniform` parameters on the entry point. Inline the function call into `compute_main` (skip trampoline for prim mode). + +**Blocking issue for Phase 2**: Auto-created `_result` is a writable `ValueRef` → NOT direct-bind (needs `RWValueRef` wrapper with buffer). Phase 2 must either: +- Accept that `_result` prevents full CallData elimination for functions with return values, and use a hybrid approach (direct args + `_result` in CallData or as a separate `RWStructuredBuffer` entry point param), OR +- Add a new code path for `_result` that emits `uniform RWStructuredBuffer _result` as an entry point param with `_result[0] = ...` for the store + +**Key files for Phase 2**: +- `slangpy/core/callsignature.py` — `generate_code()` builds the trampoline and compute_main +- `slangpy/core/calldata.py` — `CallData.build()` orchestrates the pipeline +- `slangpy/bindings/codegen.py` — `CodeGen` class manages `call_data_structs` block +- `src/slangpy_ext/utils/slangpy.cpp` — `NativeCallData::exec()` dispatches; cursor navigation for uniforms + +**`BoundVariable.direct_bind`** is already computed for all args by Phase 1. Phase 2 can check `all(arg.direct_bind for arg in all_args)` to decide whether to use the direct-args path. + +**Entry point parameter precedent**: See `slangpy/tests/device/test_pipeline_utils.slang` — manually written compute shaders already use individual `uniform` entry point params on all backends (CUDA, Vulkan, D3D12). + +**Design decisions deferred to Phase 2**: +- Whether to support hybrid kernels (some args as entry-point params, some in CallData) or only all-or-nothing +- Handle entry-point parameter size limits (CUDA ~4KB root constants, D3D12 64 DWORD root signature limit) +- Whether to inline the function call directly in compute_main for prim mode, or keep a simplified trampoline + +--- + +### Gating Tests — Pre-Implementation Checklist + +Before implementing any phase, add **gating tests** to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) that assert the CURRENT generated kernel patterns. These tests document the baseline and will intentionally break as each simplification step is implemented. + +**Design principles:** +- All gating tests are code-generation-only (no GPU dispatch) — fast and deterministic +- All tests use the existing `generate_code()` helper → `func.debug_build_call_data()` → `cd.code` +- Tests are parametrized across `helpers.DEFAULT_DEVICE_TYPES` +- String matching (substring checks) rather than regex or golden files +- Named `test_gate_*` for easy identification +- WangHashArg and dict/composite tests serve as "negative gates" — they remain passing after simplification + +**Test infrastructure** (already present in `test_kernel_gen.py`): +```python +def assert_contains(code: str, *patterns: str) -> None +def assert_not_contains(code: str, *patterns: str) -> None +def assert_trampoline_has(code: str, *stmts: str) -> None +def generate_code(device, func_name, module_source, *args, **kwargs) -> str +def generate_bwds_code(device, func_name, module_source, *args, **kwargs) -> str +``` + +**Phase 2 gating tests to add** (assert CURRENT behavior, will break on implementation): + +| Test | Asserts (current behavior) | Breaks when | +|------|---------------------------|-------------| +| `test_gate_calldata_struct_present` | `struct CallData` present | Step 2.1 | +| `test_gate_calldata_uniform_param` | `uniform CallData call_data` in `compute_main` | Step 2.2 | +| `test_gate_thread_count_in_calldata` | `call_data._thread_count` in kernel body | Step 2.4 | +| `test_gate_context_from_calldata` | `Context __slangpy_context__` present | Step 2.4 | +| `test_gate_trampoline_present_for_prim` | `void _trampoline(` present | Step 2.5 | +| `test_gate_trampoline_calls_function` | `_result = add(a, b)` inside trampoline | Step 2.5 | +| `test_gate_kernel_calls_trampoline` | `_trampoline(` inside `compute_main` | Step 2.5 | +| `test_gate_wanghasharg_forces_calldata` (negative) | `struct CallData` present with non-eligible arg | Must stay passing | + +--- + +### Verification (all phases) + +```bash +# Build first (required) +cmake --build --preset windows-msvc-debug + +# Run kernel gen tests +$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v + +# Run full test suite +pytest slangpy/tests -v + +# Run pre-commit +pre-commit run --all-files +``` + +### Key Decisions + +- Phase 1 changes both `gen_calldata` and trampoline load/store (TensorView-complete pattern, not partial) +- All dim-0 non-composite types are eligible, excluding writable value refs (which need buffer logic) +- Phase 2 targets both `entry_point` (CUDA) and `global_data` (Vulkan/D3D12) modes +- Autograd (bwds mode) is included in simplification, but implemented after prim mode within each phase +- WangHashArg explicitly excluded from direct binding (needs per-thread `thread_id` computation) diff --git a/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md b/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md new file mode 100644 index 000000000..9aedfefcd --- /dev/null +++ b/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md @@ -0,0 +1,611 @@ +## Phase 2: Eliminate CallData Struct + +**Goal**: Move kernel uniforms out of the `CallData` struct into individual entry-point parameters. Eliminate the trampoline in forward (prim) mode. Fall back to `ParameterBlock` when total inline-uniform size exceeds a runtime per-device threshold. + +**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md) + +--- + +### Key Architectural Decisions + +These decisions correct several assumptions in the original plan: + +1. **Entry-point param placement is orthogonal to `direct_bind`.** Any type — wrapped or raw — can be an entry-point parameter (e.g., `uniform ValueType a` or `uniform int a` or `uniform Tensor t`). `direct_bind` governs whether `__slangpy_load`/`__slangpy_store` is needed inside the kernel; entry-point placement governs where the uniform lives in the shader layout. + +2. **Trampoline elimination is independent of `direct_bind`.** The current trampoline body is: declare locals → load (direct assignment or `__slangpy_load`) → call function → store (`__slangpy_store`). All of that can appear directly in `compute_main`. The trampoline only exists because bwds mode needs a `[Differentiable]` wrapper for `bwd_diff()`. In prim mode, it is eliminated regardless of whether args use wrappers. + +3. **All-or-nothing fallback.** When total inline-uniform size exceeds the platform threshold, ALL args go back into `ParameterBlock` (the current path). No hybrid mixing of entry-point params and CallData. + +4. **Shape arrays and `_thread_count` obey the same rules** as user args — they become entry-point params by default, and go into `CallData` on fallback. Phase 2 is NOT scoped only to `call_data_len == 0`. + +5. **Two code paths based on where data lives:** + - **Fast path** (entry-point params): In Slang, uniforms are entry-point parameters and can be used directly (in forward) or passed directly to the trampoline (in backward). + - **Fallback path** (`ParameterBlock`): In Slang, uniforms live in a `CallData` struct. They must be read into local variables before being used (in forward) or passed to the trampoline (in backward). This is the current behavior. + +6. **C++ dispatch changes are isolated to `NativeCallData::exec`.** Marshalls receive a `ShaderCursor` pointing to wherever their data lives — they don't care whether it's inside a `CallData` struct or an entry-point param. In the fast path, `m_runtime->write_shader_cursor_pre_dispatch()` receives the entry-point cursor directly. No marshall code changes needed. + +7. **`CallDataMode` is eliminated.** The `global_data` vs `entry_point` distinction is removed entirely. On the fast path, all backends use entry-point params uniformly. On the fallback path, all backends use `ParameterBlock` — CUDA supports `ParameterBlock` and in practice will never hit the fallback (CUDA's inline-uniform limit is ~4KB). This removes the `CallDataMode` enum, the CUDA-specific `is_entry_point` codegen branch in `callsignature.py`, and the corresponding C++ branch in `slangpy.cpp`. + +8. **`PackedArg` / param-block types are unchanged.** They stay as `ParameterBlock` at module scope, orthogonal to Phase 2. + +--- + +### Current Kernel Structure (post-Phase 1) + +For `int add(int a, int b)` with scalar args `(1, 2)`: + +```slang +import "module"; +import "slangpy"; + +typealias _t_a = int; // Phase 1: raw type (was ValueType) +typealias _t__result = RWValueRef; // writable _result still wrapped +static const int _m__result = 0; // mapping constant only for _result + +struct CallData { + _t_a a; + _t_a b; + _t__result _result; + uint3 _thread_count; +}; + +void _trampoline(Context __slangpy_context__, CallData __calldata__) { + int a; + a = __calldata__.a; // Phase 1: direct assignment + int b; + b = __calldata__.b; // Phase 1: direct assignment + int _result; + _result = add(a, b); + __calldata__._result.__slangpy_store(__slangpy_context__.map(_m__result), _result); +} + +[shader("compute")] [numthreads(32,1,1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ..., uniform CallData call_data) { + if (any(flat_call_thread_id >= call_data._thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + _trampoline(__slangpy_context__, call_data); +} +``` + +### Target Kernel (Phase 2 fast path, prim mode, all direct-bind) + +```slang +import "module"; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 tid: SV_DispatchThreadID, + uniform uint3 _thread_count, + uniform int a, + uniform int b, + uniform RWStructuredBuffer _result) +{ + if (any(tid >= _thread_count)) return; + _result[0] = add(a, b); +} +``` + +### Target Kernel (Phase 2 fast path, prim mode, mixed direct/non-direct-bind) + +When some args are not direct-bind (e.g., WangHashArg needs per-thread `thread_id` via `__slangpy_load`), the non-direct-bind args still use their wrapper types as entry-point params. Context is needed: + +```slang +import "module"; +import "slangpy"; + +typealias _t_rng = WangHashArgType; // non-direct-bind wrapper type +static const int _m_rng = 0; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, + uniform uint3 _thread_count, + uniform _t_rng rng, + uniform int x, + uniform RWStructuredBuffer _result) +{ + if (any(flat_call_thread_id >= _thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + int _rng_val; + rng.__slangpy_load(__slangpy_context__.map(_m_rng), _rng_val); + int _x_val; + _x_val = x; + int _result_val; + _result_val = func(_rng_val, _x_val); + _result[0] = _result_val; +} +``` + +### Target Kernel (Phase 2 fallback path, prim mode) + +When entry-point param size exceeds the platform limit, all args go into `ParameterBlock`. The trampoline is still eliminated in prim mode — the load/call/store is inlined into `compute_main`, reading from `call_data`: + +```slang +import "module"; +import "slangpy"; + +typealias _t_a = int; +typealias _t__result = RWValueRef; +static const int _m__result = 0; + +struct CallData { + _t_a a; + _t_a b; + _t__result _result; + uint3 _thread_count; +}; +ParameterBlock call_data; + +[shader("compute")] +[numthreads(32, 1, 1)] +void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ...) { + if (any(flat_call_thread_id >= call_data._thread_count)) return; + Context __slangpy_context__ = {flat_call_thread_id}; + int a; + a = call_data.a; + int b; + b = call_data.b; + int _result; + _result = add(a, b); + call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result); +} +``` + +--- + +### Step 2.0: Gating tests ✅ + +**Status: DONE** + +Tests added to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py). All 21 parametrized cases (7 tests × 3 device types) pass. + +| Test | Source | Args | Original assertion | Status | +|------|--------|------|--------------------|--------| +| `test_gate_p2_calldata_struct_present` | `int add(int a, int b)` | `(1, 2)` | `struct CallData` in code | ✅ Flipped — now asserts `struct CallData` ABSENT (Step 2.2 done) | +| `test_gate_p2_calldata_uniform_param` | same | same | `uniform CallData call_data` or `ParameterBlock` | ✅ Flipped — now asserts both ABSENT (Step 2.2 done) | +| `test_gate_p2_thread_count_in_calldata` | same | same | `call_data._thread_count` | ✅ Flipped — now asserts ABSENT (Step 2.2 done) | +| `test_gate_p2_trampoline_present_for_prim` | same | same | `void _trampoline(` present | Still asserts present (Step 2.3 pending) | +| `test_gate_p2_kernel_calls_trampoline` | same | same | `_trampoline(` in `compute_main` body | Still asserts present (Step 2.3 pending) | +| `test_gate_p2_sv_group_id_present` | same | same | `SV_GroupID` in `compute_main` signature | ✅ Flipped — now asserts ABSENT for dim-0 calls (Step 2.2 done) | + +Negative gates (must stay passing after Phase 2): + +| Test | Asserts | +|------|---------| +| `test_gate_p2_wanghasharg_keeps_load` | Non-direct-bind arg still uses `__slangpy_load` | + +Bwds gates: + +| Test | Status | +|------|--------| +| `test_gate_scalar_uses_valuetype` | ✅ Passing — asserts fast-path trampoline with `__in_` prefix params | +| `test_gate_bwds_scalar_uses_valuetype` | ✅ Passing — bwds trampoline has `no_diff` on all params (Step 2.4 done) | + +--- + +### Step 2.1: Determine fast vs fallback path ✅ + +**Status: DONE** + +In [slangpy/core/calldata.py](slangpy/core/calldata.py), after `calculate_direct_binding(bindings)`: + +1. **Query a runtime per-device threshold** for max entry-point parameter inline-uniform size. This is a property of the device/backend — large for D3D12/CUDA (thousands of bytes), potentially as low as 128–256 bytes on Vulkan. +2. **Accumulate inline-uniform byte size** of each bound variable's `calldata_type_name`, plus `_thread_count` (12 bytes) and shape arrays (`call_data_len * 3 * sizeof(int)` for `_grid_stride`, `_grid_dim`, `_call_dim`). **Resource types** (`RWStructuredBuffer`, `Texture2D`, `TensorView`, etc.) don't count — they are bound as descriptors, not inline data. +3. **Decision**: If total size ≤ threshold → `self.use_entrypoint_args = True` (fast path). Otherwise → `self.use_entrypoint_args = False` (fallback path — current behavior). +4. **Store** `use_entrypoint_args` on the `CallData` instance and propagate to C++ `NativeCallData`. + +`PackedArg` / param-block types are excluded from this accounting — they stay as `ParameterBlock` regardless. + +**Implementation details:** + +- `DeviceLimits.max_entry_point_uniform_size` added to C++ struct ([device.h](src/sgl/device/device.h)) with per-backend defaults: Vulkan=128, D3D12=256, CUDA=4096 bytes ([device.cpp](src/sgl/device/device.cpp)). +- `calculate_inline_uniform_size()` added to [callsignature.py](slangpy/core/callsignature.py) — sums `vector_type.uniform_layout.size` for each depth-0 bound variable (skipping `PackedArg`), plus 12 bytes for `_thread_count` and `call_dimensionality * 4 * 3` for shape arrays. +- `use_entrypoint_args` property added to `NativeCallData` C++ class ([slangpy.h](src/slangpy_ext/utils/slangpy.h)) with Python binding. +- `CallData.__init__()` in [calldata.py](slangpy/core/calldata.py) sets `self.use_entrypoint_args = inline_size <= threshold` after `calculate_direct_binding()`. + +**Tests** (7 tests × 3 device types = 21 parametrized cases, all pass): + +| Test | Asserts | +|------|---------| +| `test_step21_scalar_uses_entrypoint_args` | Simple `int add(int,int)` with `(1,2)` → `use_entrypoint_args=True` | +| `test_step21_threshold_property_positive` | `device.info.limits.max_entry_point_uniform_size > 0` | +| `test_step21_vector_uses_entrypoint_args` | `float3` args → `use_entrypoint_args=True` | +| `test_step21_struct_uses_entrypoint_args` | All-scalar struct dict → `use_entrypoint_args=True` | +| `test_step21_tensor_uses_entrypoint_args` | Tensor (descriptor-only, 0 inline bytes) → `use_entrypoint_args=True` | +| `test_step21_many_float4x4_may_exceed_vulkan` | 8×float4x4 (524 bytes) exceeds Vulkan/D3D12 thresholds, not CUDA | +| `test_step21_wanghasharg_uses_entrypoint_args` | Non-direct-bind WangHashArg with small inline size → `use_entrypoint_args=True` | + +--- + +### Step 2.2: Code generation — entry-point params (fast path) ✅ + +**Status: DONE** + +In [slangpy/core/callsignature.py](slangpy/core/callsignature.py) `generate_code()`, when `use_entrypoint_args == True`: + +**CodeGen changes** in [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py): +- Add a `skip_call_data` flag to `CodeGen.__init__`. When `True`, don't emit `struct CallData` / `begin_block()` and gate `end_block()` in `finish()`. +- Add `self.entry_point_params: list[str] = []` to collect individual uniform param declarations. +- `finish()` ignores the `call_data` block and `use_param_block_for_call_data` when `skip_call_data` is set. + +**CallData struct elimination**: Set `cg.skip_call_data = True` when `use_entrypoint_args`. No `struct CallData` emitted. + +**`gen_call_data_code` change** in [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py): At `depth == 0`, when `use_entrypoint_args`, append to `cg.entry_point_params` instead of `cg.call_data.declare(...)`. The `call_data_structs` block (type aliases, wrapper structs, mapping constants) still gets emitted at module scope. + +**`_thread_count` and shape arrays**: Instead of `cg.call_data.append_statement("uint3 _thread_count")`, append to `cg.entry_point_params`. Same for `_grid_stride`, `_grid_dim`, `_call_dim` when `call_data_len > 0`. + +**Entry-point signature**: `compute_main` signature becomes: +```slang +void compute_main( + int3 flat_call_thread_id: SV_DispatchThreadID, + [int3 flat_call_group_id: SV_GroupID,] // only when call_data_len > 0 + [int flat_call_group_thread_id: SV_GroupIndex,] // only when call_data_len > 0 + uniform uint3 _thread_count, + [uniform int[N] _grid_stride, ...] // only when call_data_len > 0 + uniform _t_a a, + uniform _t_b b, + uniform _t__result _result +) +``` + +Drop `SV_GroupID` and `SV_GroupIndex` when `call_data_len == 0` — they feed `init_thread_local_call_shape_info` which isn't called when there are no shape arrays. + +**Bounds check**: Changes from `call_data._thread_count` to just `_thread_count`. + +**Shape info init**: Changes from `call_data._grid_stride` etc. to just `_grid_stride`, `_grid_dim`, `_call_dim`. + +**Fallback path** (`use_entrypoint_args == False`): `struct CallData` is emitted with `ParameterBlock call_data` at module scope on ALL backends (including CUDA). The old `CallDataMode` distinction between `entry_point` (CUDA) and `global_data` (non-CUDA) is removed — `ParameterBlock` works on CUDA, and in practice CUDA will never hit the fallback due to its large (~4KB) inline-uniform limit. + +See [slangpy/tests/device/test_pipeline_utils.slang](slangpy/tests/device/test_pipeline_utils.slang) for examples of manually-written compute shaders that use entry point parameters on all backends (CUDA, Vulkan, D3D12). + +--- + +### Step 2.3: Trampoline elimination for prim mode + +**Status: NOT STARTED** — Trampoline is still generated for prim mode on both paths. The load/call/store sequence needs to be inlined into `compute_main`. + +When `call_mode == prim` — on **both** fast and fallback paths: + +- Don't generate the `_trampoline` function. +- Inline the load/call/store sequence directly into `compute_main` after the bounds check and (if needed) Context construction. +- The load/call/store codegen reuses the same logic currently in [callsignature.py lines 378–449](slangpy/core/callsignature.py#L378-L449), but emitted into `cg.kernel` instead of `cg.trampoline` with adjusted `data_name`: + +| Path | `data_name` for non-param-block args | +|------|-------------------------------------| +| Fast | `x.variable_name` (entry-point param name directly) | +| Fallback | `call_data.{x.variable_name}` (global `ParameterBlock`, all backends) | +| Param blocks | `_param_{x.variable_name}` (unchanged) | + +**Context construction**: Needed only when any arg is non-direct-bind (i.e., calls `__slangpy_load`/`__slangpy_store`). When all args satisfy `direct_bind == True`, skip Context construction entirely — no `Context __slangpy_context__` declaration, no `import "slangpy"`. + +**Note**: The trampoline elimination does NOT depend on `direct_bind`. Even non-direct-bind args with `__slangpy_load` work inline in `compute_main` — the `__slangpy_load` call just needs the data reference and a `Context` value, both available in `compute_main`. + +--- + +### Step 2.4: Trampoline with individual params for bwds mode ✅ + +**Status: DONE** — Fast-path trampoline takes individual params with `no_diff` on all params. All 3 device types pass. + +When `call_mode == bwds`: + +- Still generate a `[Differentiable]` trampoline function. +- **Fast path**: Trampoline takes individual params instead of a struct. All params get `no_diff` — entry-point uniforms are never differentiable. Differentiation happens through local variable assignments inside the trampoline body, matching the struct-based approach where `CallData` was implicitly non-differentiable. No `in`/`out`/`inout` modifiers are added — `compute_main` passes its uniforms straight through: + ```slang + [Differentiable] + void _trampoline(Context __slangpy_context__, no_diff float __in_a, no_diff float __in_b, no_diff NoneType __in__result) + ``` + `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, a, b, _result)` passing entry-point param names directly. +- **Fallback path**: Trampoline reads from global `ParameterBlock call_data` as it does today (on all backends). `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, call_data)`. +- `_gen_trampoline_argument()` in `boundvariable.py` remains unused dead code — the inline generation in `callsignature.py` is simpler and avoids the `in`/`out`/`inout` modifiers that caused Slang autodiff errors. + +**Key insight**: Adding `in`/`out`/`inout` modifiers to trampoline params caused Slang autodiff issues (e.g., `out` params get reversed to `in` by `bwd_diff`, changing arity). The trampoline params are just pass-through uniforms — all data flow logic (loads, stores, differentiation) is handled internally via local variables. + +--- + +### Step 2.5: C++ dispatch changes ✅ + +**Status: DONE** — `CallDataMode` enum fully removed. Fast path uses `find_entry_point(0)` on all backends. Fallback path uses global `ParameterBlock` on all backends. + +In [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), store `m_use_entrypoint_args` on `NativeCallData` (received from Python `CallData`). Also add to [slangpy.h](src/slangpy_ext/utils/slangpy.h). + +Modify `bind_call_data` lambda in `exec()`: + +**Fast path** (`m_use_entrypoint_args == true`): +- All backends: Navigate via `cursor.find_entry_point(0)`. This is the entry-point cursor. +- Write `_thread_count` as an entry-point param: `entry_point_cursor["_thread_count"]`. +- Write shape arrays as entry-point params: `entry_point_cursor["_grid_stride"]`, etc. +- Pass `entry_point_cursor` as the `call_data_cursor` argument to `m_runtime->write_shader_cursor_pre_dispatch()`. Each `NativeBoundVariableRuntime` already navigates `cursor[m_variable_name]`, so it finds the entry-point param by name automatically. **No marshall code changes needed.** +- Cache entry-point param field indices on first call (analogous to existing `m_cached_call_data_offsets`). +- The `reserve_data` + raw-pointer optimization for `_thread_count` and shape arrays may not work for individual entry-point params at disjoint offsets. Use cursor-based writes for these metadata fields (they're small, performance impact minimal), or check if `reserve_data` still works across the entry-point shader object. + +**Fallback path** (`m_use_entrypoint_args == false`): +- All backends: Navigate to global `call_data` field via `cursor.find_field("call_data")`, dereference (it's a `ParameterBlock`), write struct data. The old `CallDataMode` branch (CUDA using `find_entry_point(0)` for call_data) is removed. Remove `m_call_data_mode`, `CallDataMode` enum, and all associated branches from `slangpy.h`, `slangpy.cpp`, `calldata.py`, and `callsignature.py`. + +--- + +### Step 2.6: `_result` handling + +**Status: NOT STARTED** + +Auto-created `_result` is a writable `ValueRef`, currently NOT direct-bind eligible (needs `RWValueRef` wrapper with buffer logic). Phase 2 handles this differently on the two paths: + +**Fast path**: `_result` is emitted as `uniform RWValueRef _result` on the entry point. In prim mode, the inlined code stores via `_result.__slangpy_store(...)`. In the all-direct-bind case where Context is omitted, add a new code path: emit `uniform RWStructuredBuffer _result` with `_result[0] = value` for the store. This requires `ValueRefMarshall` to support writable direct-bind for the entry-point-param case specifically, using `RWStructuredBuffer` instead of `RWValueRef`. + +**Fallback path**: `_result` stays as `RWValueRef` inside `CallData`, same as current behavior. + +**Implementation note**: The `RWStructuredBuffer` approach for `_result` is only used when `use_entrypoint_args == True` AND all other args are direct-bind (so Context can be omitted). When non-direct-bind args are present, Context exists and `_result` can continue to use `RWValueRef.__slangpy_store(context, value)`. + +--- + +### Step 2.7: Tests + +**Status: PARTIAL** — Tests for completed Phase 2 steps added to [test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py). Remaining tests for Step 2.3 (trampoline elimination) and Step 2.6 (`_result` as `RWStructuredBuffer`) will be added when those steps are implemented. + +**Tests added** (in [test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py), tests 35–38, 40): + +| Test | Verifies | Merges from test_kernel_gen.py | +|------|----------|-------------------------------| +| `test_entrypoint_params_scalar_dim0` (#35) | Fast path: no `struct CallData`, individual `uniform` params, `_thread_count` direct, `SV_GroupID` absent at dim-0, `use_entrypoint_args=True` | `test_gate_p2_calldata_struct_absent_fast_path`, `test_gate_p2_individual_uniform_params`, `test_gate_p2_thread_count_direct`, `test_gate_p2_sv_group_id_absent_dim0`, `test_step21_scalar_uses_entrypoint_args` | +| `test_entrypoint_params_vectorized` (#36) | Vectorized fast path: shape arrays as entry-point params, `SV_GroupID`/`SV_GroupIndex` present, no `struct CallData` | (new — covers vectorized entry-point param path) | +| `test_entrypoint_params_non_direct_bind` (#37) | Non-direct-bind arg (WangHashArg) on fast path: no `struct CallData`, wrapper type used, `__slangpy_load`/`Context` present | `test_gate_p2_wanghasharg_keeps_load`, `test_step21_wanghasharg_uses_entrypoint_args` | +| `test_bwds_entrypoint_no_diff_params` (#38) | Bwds fast path: trampoline params have `no_diff` and `__in_` prefix, `bwd_diff(_trampoline)` passes individual args, `[Differentiable]` before trampoline | (new — covers Step 2.4 bwds trampoline) | +| `test_fallback_calldata_large_params` (#40) | Fallback path: 8×float4x4 exceeds threshold → `ParameterBlock`, `call_data._thread_count`; CUDA stays fast path | `test_step21_many_float4x4_may_exceed_vulkan` (adds codegen assertions) | + +**Post-implementation tests** — to be added when remaining steps are complete: + +| Test | Verifies | Blocked on | +|------|----------|------------| +| `test_phase2_no_trampoline_prim` | No `void _trampoline(` for prim-mode calls | Step 2.3 | +| `test_phase2_inline_call` | Function call inlined directly in `compute_main` | Step 2.3 | +| `test_phase2_no_context_all_direct` | No `Context __slangpy_context__` when all args direct-bind | Step 2.3 | +| `test_phase2_fallback_no_trampoline_prim` | Even fallback path eliminates trampoline in prim mode | Step 2.3 | + +--- + +### Implementation Order + +1. **Step 2.0** ✅ — Gating tests (baseline documentation) +2. **Step 2.1** ✅ — Fast/fallback determination + size query +3. **Step 2.2 + 2.5** ✅ — Code gen + C++ dispatch for entry-point params + `CallDataMode` removal (landed together) +4. **Step 2.4** ✅ — Bwds trampoline with individual params (fast path) — `no_diff` on all params +5. **Step 2.3** — Trampoline elimination for prim mode (both paths) +6. **Step 2.6** — `_result` as `RWStructuredBuffer` for all-direct-bind case +7. **Step 2.7** — Post-implementation tests + functional tests + +**Note:** Implementation order deviated from original plan — Steps 2.2 + 2.5 were done before 2.3 (trampoline elimination), combined with `CallDataMode` removal. Step 2.4 done — all trampoline params use `no_diff` without IO modifiers. + +--- + +### Key Files + +| File | Changes | +|------|---------| +| [slangpy/core/calldata.py](slangpy/core/calldata.py) | ✅ `use_entrypoint_args` flag, size threshold check, `CallDataMode` removed | +| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | ✅ Entry-point params, fast/fallback code paths, `is_entry_point` branch removed. Trampoline still generated (Step 2.3 pending). Bwds `no_diff` on all trampoline params (Step 2.4 done). | +| [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py) | ✅ `skip_call_data` flag, `entry_point_params` list | +| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | ✅ `gen_call_data_code` depth-0 entry-point path. `_gen_trampoline_argument()` unused — inline generation in `callsignature.py` used instead. | +| [slangpy/bindings/marshall.py](slangpy/bindings/marshall.py) | ✅ `use_entrypoint_args` field on `BindContext`, `CallDataMode` removed | +| [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) | ✅ `use_entrypoint_args` binding; `bind_call_data` fast path via `find_entry_point(0)`, `CallDataMode` branches removed | +| [src/slangpy_ext/utils/slangpy.h](src/slangpy_ext/utils/slangpy.h) | ✅ `m_use_entrypoint_args` on `NativeCallData`; `m_call_data_mode` removed | +| [src/sgl/device/device.h](src/sgl/device/device.h) | ✅ `max_entry_point_uniform_size` on `DeviceLimits` | +| [src/sgl/device/device.cpp](src/sgl/device/device.cpp) | ✅ Per-backend defaults for `max_entry_point_uniform_size` | +| [src/slangpy_ext/device/device.cpp](src/slangpy_ext/device/device.cpp) | ✅ Python binding for `max_entry_point_uniform_size` | +| [src/sgl/utils/slangpy.h](src/sgl/utils/slangpy.h) | ✅ `CallDataMode` enum removed | +| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | ✅ `CallDataMode` removed | +| [slangpy/core/packedarg.py](slangpy/core/packedarg.py) | ✅ `CallDataMode` removed | +| [slangpy/core/function.py](slangpy/core/function.py) | ✅ `CallDataMode` removed from imports | +| [slangpy/slangpy/__init__.pyi](slangpy/slangpy/__init__.pyi) | ✅ `CallDataMode` class and `call_data_mode` property removed | +| [slangpy/tests/slangpy_tests/test_type_resolution.py](slangpy/tests/slangpy_tests/test_type_resolution.py) | ✅ `CallDataMode` removed from `BindContext` creation | +| [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) | ✅ Gating tests + Step 2.1 tests updated for new behavior | +| [slangpy/tests/slangpy_tests/test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py) | ✅ Phase 2 tests 35–38, 40 added (Step 2.7 partial) | + +--- + +### Verification + +```bash +# Build first (required) +cmake --build --preset windows-msvc-debug + +# Run kernel gen tests +$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v + +# Run full test suite +pytest slangpy/tests -v + +# Run pre-commit +pre-commit run --all-files +``` + +--- + +### PR #862 Code Review — Proposed Improvements + +#### High Severity + +**1. Potential correctness bug — fast-path shape offset caching guarded by runtime data** + +In [slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) `bind_call_data`, the fast-path caching block guards shape offset caching with `call_shape.size() > 0`. If the *first* call to a multi-dimensional `NativeCallData` uses `has_thread_count=true` (which returns empty `call_shape`), shape offsets won't be cached. A subsequent normal call would find `is_valid == true` but shape offsets would be uninitialized, leading to writes at garbage offsets. The fallback path is more robust, using `call_dim.is_valid()` instead. + +**DO NOT FIX**: Reason: The '_thread_count' is written to the call signature, so by definition a given call data would never be used in both situations. + +**2. Benchmark changes are debugging artifacts** + +[test_benchmark_autograd.py](slangpy/benchmarks/test_benchmark_autograd.py) changes `ITERATIONS` 10→100, `WARMUPS` 10→1000, `RUN_SLANGTORCH_BENCHMARK` False→True. This will make CI benchmarks 10–100× slower. Revert to original values. + +**FIXED**: Restored `ITERATIONS=10`, `WARMUPS=10`, `RUN_SLANGTORCH_BENCHMARK=False`. + +**3. Overly broad `except Exception` in calldata.py fallback** + +[calldata.py](slangpy/core/calldata.py): The fallback from fast path to `ParameterBlock` catches `except Exception`, which swallows `TypeError`, `KeyError`, `AttributeError`, etc. The caught exception `e` is never logged. + +**FIXED**: Narrowed to `except RuntimeError as e` and included `str(e)` in the debug message. + +--- + +#### Medium Severity — Structural + +**4. `generate_code()` in callsignature.py is too long (~334 lines)** + +Extract into sub-functions: + +| Lines | Extract to | Purpose | +|-------|-----------|---------| +| ~L294–L339 | `_validate_and_compute_group_shape()` | Group shape validation & stride computation | +| ~L341–L388 | `_generate_link_time_constants()` | Link-time constants (group shape/stride arrays) | +| ~L390–L409 | `_generate_shape_params()` | Shape array & `_thread_count` param gen (fast/fallback) | +| ~L415–L517 | `_generate_trampoline()` | Trampoline function (signature, loads, call, stores) | +| ~L520–L565 | `_generate_entry_point_signature()` | Compute/ray-tracing entry-point signature | +| ~L567–L604 | `_generate_kernel_body()` | Kernel body (bounds check, shape init, dispatch) | + +Additionally, the duplicated `data_name` computation at ~L449 and ~L497 should be extracted: +```python +def _data_name(x: BoundVariable, use_entrypoint_args: bool) -> str: + if x.create_param_block: + return f"_param_{x.variable_name}" + return f"__in_{x.variable_name}" if use_entrypoint_args else f"call_data.{x.variable_name}" +``` + +**DO NOT FIX** Reason: This is a complex change and will be deferred to a later step. + +**5. `bind_call_data` in slangpy.cpp has ~70 lines of duplicated write logic** + +The `reserve_data` + `write_strided_array_helper` ×3 + `write_value_helper` + `write_shader_cursor_pre_dispatch` sequence is identical between fast and fallback paths. Extract a helper that takes a `ShaderCursor`: + +```cpp +auto write_uniforms = [&](ShaderCursor target) { + ShaderObject* so = target.shader_object(); + void* base = so->reserve_data(offsets.field_offset, offsets.field_size); + // ... write shape arrays, thread_count ... + m_runtime->write_shader_cursor_pre_dispatch(context, cursor, target, ...); +}; +``` + +Fast path → `write_uniforms(ep)`, fallback → `write_uniforms(call_data_cursor)`. + +**FIXED**: Extracted `write_uniforms` lambda taking `(ShaderCursor target, ShaderCursor root_cursor)`. Fast path calls `write_uniforms(ep, cursor)`, fallback calls `write_uniforms(call_data_cursor, cursor)`. + +**6. `_try_build_shader` parameter pattern in calldata.py** + +Takes `use_entrypoint_args` parameter then immediately sets `self.use_entrypoint_args` and `context.use_entrypoint_args`. The method never reads the flag except to store it. + +**FIXED**: Caller sets `self.use_entrypoint_args` before calling; `_try_build_shader` reads `self.use_entrypoint_args` and sets `context.use_entrypoint_args`. Parameter removed. + +--- + +#### Low Severity + +**7. Unconditional `print(code)` in test_kernel_gen.py L107** — should be guarded by `PRINT_TEST_KERNEL_GEN` env var. + +**FIXED**: Guarded with `if PRINT_TEST_KERNEL_GEN:` (existing module-level flag). + +**8. Test duplication** — ~30 tests near-identical between test_kernel_gen.py and test_code_gen.py. The merged tests in test_code_gen.py should replace the originals. + +**DO NOT FIX**: Reason: The kernel gen tests are temporary, designed for gating, and will be deleted once phases are complete. + +**9. Unused `nodes` variable** — [callsignature.py L278](slangpy/core/callsignature.py): `nodes: list[BoundVariable] = []` declared but never used. + +**FIXED**: Deleted unused variable. + +**10. Stale docstring** — [callsignature.py L275](slangpy/core/callsignature.py): Says "Generate a list of call data nodes" — doesn't match what the function does. + +**FIXED**: Updated to "Generate Slang kernel code for the given function call signature." + +**11. Missing return type annotations** — `generate_code()`, `generate_constants()`, `CallData.build()` all need `-> None`. + +**FIXED**: Added `-> None` to `generate_code()`, `generate_constants()`, `CallData.build()`, and `_try_build_shader()`. + +**12. `type_conformances: Any`** — [calldata.py](slangpy/core/calldata.py) should be `list[TypeConformance]`. + +**FIXED**: Changed to `list["TypeConformance"]` and added `TypeConformance` to the `from slangpy import (...)` block. + +**13. Bare `except:`** — [callsignature.py L59](slangpy/core/callsignature.py): `is_generic_vector` catches all exceptions including `SystemExit`. Use `except Exception:`. + +**FIXED**: Changed to `except Exception:`. + +**14. Typo: `santized_module`** — [calldata.py](slangpy/core/calldata.py): Missing 'i'. Pre-existing. + +**DO NOT FIX**: Reason: Cosmetic typo in a variable name that's used in multiple places. Fixing would require renaming across the file, which is low value and risks introducing bugs. + +**15. D3D12 `max_entry_point_uniform_size = 256` may be optimistic** — root descriptors consume some of the 64-DWORD root signature budget. Comment should note shared budget; consider smaller default. + +**DO NOT FIX**: Reason: More complex logic is actually needed and can be addressed later. + +**16. Fallback path always includes `SV_GroupID`/`SV_GroupIndex`** — even when `call_data_len == 0`. Asymmetric with fast path. + +**DO NOT FIX**: Reason: Can be addressed later. + +**17. Hash salt `"[CallData]\n"`** — emitted even when CallData struct is absent. Cosmetic. + +**FIXED**: Removed `"[CallData]\n"` prefix from hash salt. + +**18. `Tuple` import in test_code_gen.py** — should use lowercase `tuple[...]` for consistency. + +**FIXED**: Changed to `tuple[...]` and removed `Tuple` from typing import. + +--- + +#### Additional Findings (subagent review, March 2026) + +**19. Latent correctness bug — `can_direct_bind_common()` missing write-access guard** + +[boundvariable.py](slangpy/bindings/boundvariable.py) `can_direct_bind_common()` does not check whether the binding has write access. This creates an inconsistency: + +- `ValueRefMarshall.can_direct_bind()` explicitly rejects writable bindings — correct +- `StructMarshall.can_direct_bind()` with children checks `access[0] == AccessType.read` — correct +- `StructMarshall.can_direct_bind()` without children falls through to `can_direct_bind_common()` — **missing access check** +- `ValueMarshall.can_direct_bind()` delegates entirely to `can_direct_bind_common()` — safe in practice (`ValueMarshall.is_writable = False`) but fragile + +If a writable dim-0 leaf binding gets `direct_bind=True`, `ValueMarshall.gen_trampoline_store()` returns `True` without emitting store code, silently dropping writes. + +**DO NOT FIX**: Reasion: This logic is subtle but correct, based on the desired behaviour. + +**20. Dead `_gen_trampoline_argument()` method** + +[boundvariable.py](slangpy/bindings/boundvariable.py) `_gen_trampoline_argument()` is never called anywhere in the codebase. The inline generation in [callsignature.py](slangpy/core/callsignature.py) replaced it. + +**FIXED**: Deleted the method. + +**21. Redundant `hasattr` guard in `calculate_direct_bind()`** + +[boundvariable.py](slangpy/bindings/boundvariable.py) `calculate_direct_bind()` uses `hasattr(self.python, "can_direct_bind")`, which is always `True` because `Marshall` base class defines `can_direct_bind()`. Simplify to `if self.python is not None:`. + +**DO NOT FIX**: Reason: For marshalls that inherit directly from NativeMarshall, this is not necessarily true. + +**22. Unnecessary `getattr` in `can_direct_bind_common()`** + +[boundvariable.py](slangpy/bindings/boundvariable.py) `can_direct_bind_common()` uses `getattr(binding, "create_param_block", False)`. `BoundVariable.__init__()` always sets `create_param_block`, so `binding.create_param_block` suffices. + +**FIXED**: Replaced `getattr(binding, "create_param_block", False)` with `binding.create_param_block`. + +**23. Wasteful `CodeGen.call_data` initialization when `skip_call_data=True`** + +[codegen.py](slangpy/bindings/codegen.py) `__init__` unconditionally calls `self.call_data.append_line("struct CallData")` and `begin_block()`, even when `skip_call_data=True`. The block is never serialized so there's no output impact, but it allocates a dangling block object. + +**DO NOT FIX**: Reason: Harmless — the block is never emitted. Restructuring `__init__` to conditionally skip initialization adds complexity for no functional benefit. + +**24. `entry_point_params` ownership pattern undocumented** + +[codegen.py](slangpy/bindings/codegen.py) collects `entry_point_params` via `boundvariable.py`, but [callsignature.py](slangpy/core/callsignature.py) reads and emits them. This cross-module ownership pattern is unconventional and lacks a comment explaining the flow. + +**DO NOT FIX**: Reason: `CodeGen` is already a shared state bag consumed by multiple modules. Adding a comment is fine but not blocking. + +**25. `direct_bind` and `use_entrypoint_args` exposed as read-write in `.pyi` stubs** + +[__init__.pyi](slangpy/slangpy/__init__.pyi) exposes `direct_bind` on `NativeBoundVariableRuntime` and `use_entrypoint_args` on `NativeCallData` with setters. Mutating these after first dispatch could invalidate cached cursor offsets in `NativeValueMarshall::ensure_cached`. + +**DO NOT FIX**: Reason: These are set during `CallData` construction before first dispatch. The cached `NativeCallData` is per-signature, so a new signature gets a fresh instance. Post-construction mutation would require going through `debug_build_call_data` which rebuilds everything. Not a practical concern. + +**26. No fallback-path codegen test in `test_code_gen.py`** + +[test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py) has no test that forces `use_entrypoint_args=False` (e.g., by exceeding `max_entry_point_uniform_size`) and asserts the `ParameterBlock` codegen. The `test_step21_many_float4x4_may_exceed_vulkan` in `test_kernel_gen.py` checks the flag but not the generated code. + +**FIXED**: Added `test_fallback_calldata_large_params` (#40) in `test_code_gen.py` — asserts `ParameterBlock` codegen on Vulkan/D3D12 and fast-path codegen on CUDA. + +**27. No test for writable `inout` struct at dim-0** + +No test verifies the behavior of a writable (inout) dim-0 struct with all-scalar fields. This is the scenario where Fix 19 would prevent silent write loss. + +**Fix**: Add after Fix 19 is applied —test a writable dim-0 struct dict to confirm `direct_bind=False`. + +**Status: NOT FIXED** — blocked on Fix 19.