diff --git a/.github/prompts/plan-extractCodegenToGenerator.prompt.md b/.github/prompts/plan-extractCodegenToGenerator.prompt.md
new file mode 100644
index 000000000..e7473d62e
--- /dev/null
+++ b/.github/prompts/plan-extractCodegenToGenerator.prompt.md
@@ -0,0 +1,208 @@
+## Extract codegen into generator.py
+
+**Goal**: Extract the code-emission logic from [callsignature.py](slangpy/core/callsignature.py) (`generate_code`, `generate_constants`, `KernelGenException`, helpers) and `BoundVariable.gen_call_data_code` from [boundvariable.py](slangpy/bindings/boundvariable.py) into a new [generator.py](slangpy/core/generator.py) file. The new file decomposes the monolithic `generate_code` (332 lines) into clearly-named sub-functions with doc comments showing what Slang code each one emits. `callsignature.py` retains the binding-pipeline functions (`specialize`, `bind`, `calculate_*`, etc.). Each step is a pure move/rename with no behavioral changes, verifiable by the existing test suites.
+
+**Parent plan**: [plan-simplifyKernelGenPhase2-cleanup.prompt.md](plan-simplifyKernelGenPhase2-cleanup.prompt.md)
+
+---
+
+### Step 1: Create `slangpy/core/generator.py` with `generate_constants` and `KernelGenException`
+
+Move these small, self-contained pieces first:
+
+- **Move** `KernelGenException` (lines 40–43) from [callsignature.py](slangpy/core/callsignature.py#L40-L43).
+- **Move** `is_slangpy_vector` (lines 240–247) from [callsignature.py](slangpy/core/callsignature.py#L240-L247) — private helper, prefix with `_`.
+- **Move** `generate_constants` (lines 250–268) from [callsignature.py](slangpy/core/callsignature.py#L250-L268).
+- **In [callsignature.py](slangpy/core/callsignature.py)**: Add `from slangpy.core.generator import KernelGenException, generate_constants` and delete the moved code. Keep a re-export of `KernelGenException` so any external consumer of the wildcard import from [calldata.py](slangpy/core/calldata.py#L8) continues to work.
+- **In [dispatchdata.py](slangpy/core/dispatchdata.py#L7)**: Change `from slangpy.core.callsignature import generate_constants` → `from slangpy.core.generator import generate_constants`.
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass, no import errors.
+
+**DONE**: Created `slangpy/core/generator.py` with `KernelGenException`, `_is_slangpy_vector`, `generate_constants`. Replaced definitions in `callsignature.py` with re-exports. Updated `dispatchdata.py` import. 4999 passed, 5 pre-existing failures (raytrace d3d12, type conformance cache).
+
+---
+
+### Step 2: Extract `gen_call_data_code` as a free function
+
+Move `BoundVariable.gen_call_data_code` (lines 604–693 of [boundvariable.py](slangpy/bindings/boundvariable.py#L604-L693)) into `generator.py` as a free function, along with the related `gen_calldata_type_name` helper (lines 258–272 of [boundvariable.py](slangpy/bindings/boundvariable.py#L258-L272)).
+
+- **In `generator.py`**: Create two free functions:
+  - `gen_calldata_type_name(binding: BoundVariable, cgb: CodeGenBlock, type_name: str) -> None` — same logic, takes `binding` as first arg instead of `self`.
+  - `gen_call_data_code(binding: BoundVariable, cg: CodeGen, context: BindContext, depth: int = 0) -> None` — same logic, recursive calls use the free function. References to `self` become `binding`. Internal calls to `self.gen_calldata_type_name(...)` become `gen_calldata_type_name(binding, ...)`. Recursive calls on children become `gen_call_data_code(child, cg, context, depth + 1)`.
+- **In [boundvariable.py](slangpy/bindings/boundvariable.py)**: Replace the method bodies with thin delegations:
+  ```python
+  def gen_calldata_type_name(self, cgb, type_name):
+      from slangpy.core.generator import gen_calldata_type_name
+      gen_calldata_type_name(self, cgb, type_name)
+
+  def gen_call_data_code(self, cg, context, depth=0):
+      from slangpy.core.generator import gen_call_data_code
+      gen_call_data_code(self, cg, context, depth)
+  ```
+  This preserves the existing call interface (`node.gen_call_data_code(cg, context)` in [callsignature.py line 406](slangpy/core/callsignature.py#L406)) and any marshall subclass code that calls `self.gen_calldata_type_name`. The `MAX_INLINE_TYPE_LEN` constant moves to `generator.py`.
+- **Move** the import of `CodeGen` and `CodeGenBlock` into `generator.py` (already needed for Step 1).
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass.
+
+**DONE**: Moved `gen_call_data_code` and `gen_calldata_type_name` to `generator.py` as free functions. `MAX_INLINE_TYPE_LEN` moved to `generator.py`, re-exported from `boundvariable.py`. Method bodies replaced with thin delegation stubs. 3294 passed, 285 kernel gen tests passed.
+
+---
+
+### Step 3a: Extract pure-computation helpers in-place in `callsignature.py`
+
+Extract the two helpers that do **no codegen** — pure calculation/validation only:
+
+- **Extract** `_validate_and_compute_group_shape(build_info, call_data_len) -> tuple[int, list[int], list[int]]` from lines [293–340](slangpy/core/callsignature.py#L293-L340). Returns `(call_group_size, call_group_strides, call_group_shape_vector)`.
+- **Extract** `_data_name(x, use_entrypoint_args) -> str` — deduplicate the two inline occurrences at lines [449](slangpy/core/callsignature.py#L449) and [497](slangpy/core/callsignature.py#L497) into a single helper. Returns `__in_{name}`, `call_data.{name}`, or `_param_{name}`.
+
+Leave both in `callsignature.py` as module-private functions. `generate_code` calls them.
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass.
+
+---
+
+### Step 3b: Extract "setup" emission functions in-place in `callsignature.py`
+
+Extract the three functions that emit the top section of the generated kernel:
+
+- **Extract** `_emit_link_time_constants(cg, build_info, call_data_len, call_group_size, call_group_strides, call_group_shape_vector)` from lines [342–371](slangpy/core/callsignature.py#L342-L371). Emits `export static const int call_data_len = ...`, group stride/shape arrays; calls `generate_constants()`.
+- **Extract** `_emit_shape_and_metadata_params(cg, call_data_len, use_entrypoint_args)` from lines [373–403](slangpy/core/callsignature.py#L373-L403). Emits `_grid_stride`, `_grid_dim`, `_call_dim`, `_thread_count` — as entry-point params (fast path) or `CallData` fields (fallback).
+- **Extract** `_emit_call_data_definitions(cg, context, signature)` from lines [405–406](slangpy/core/callsignature.py#L405-L406). Emits per-variable call data (wrapper structs, type aliases, mapping constants) by calling `gen_call_data_code` on each node.
+
+Leave all three in `callsignature.py`. `generate_code` calls them.
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` and capture output as the baseline for Step 3c and 3d.
+
+---
+
+### Step 3c: Extract "body" emission functions in-place in `callsignature.py`
+
+Extract the remaining three functions that emit the entry point and kernel body:
+
+- **Extract** `_emit_trampoline(cg, context, build_info, root_params, use_entrypoint_args)` from lines [408–500](slangpy/core/callsignature.py#L408-L500). Emits `[Differentiable] void _trampoline(...)` — param declarations, loads, function call, stores.
+- **Extract** `_emit_entry_point_signature(cg, build_info, call_data_len, call_group_size, use_entrypoint_args)` from lines [503–541](slangpy/core/callsignature.py#L503-L541). Emits `[shader("compute")] [numthreads(...)] void compute_main(...)` or `[shader("raygen")] void raygen_main(...)`.
+- **Extract** `_emit_kernel_body(cg, context, build_info, root_params, call_data_len, use_entrypoint_args)` from lines [543–603](slangpy/core/callsignature.py#L543-L603). Emits bounds check, `init_thread_local_call_shape_info`, Context construction, trampoline call.
+
+At this point `generate_code` is reduced to the ~30-line orchestrator below. Still in `callsignature.py`.
+
+```python
+def generate_code(context, build_info, signature, cg):
+    use_entrypoint_args = context.use_entrypoint_args
+    cg.add_import("slangpy")
+    call_data_len = context.call_dimensionality
+
+    call_group_size, strides, shape = _validate_and_compute_group_shape(build_info, call_data_len)
+
+    cg.add_import(build_info.module.name)
+    if use_entrypoint_args:
+        cg.skip_call_data = True
+
+    _emit_link_time_constants(cg, build_info, call_data_len, call_group_size, strides, shape)
+    _emit_shape_and_metadata_params(cg, call_data_len, use_entrypoint_args)
+    _emit_call_data_definitions(cg, context, signature)
+
+    root_params = sorted(signature.values(), key=lambda x: x.param_index)
+
+    _emit_trampoline(cg, context, build_info, root_params, use_entrypoint_args)
+    _emit_entry_point_signature(cg, build_info, call_data_len, call_group_size, use_entrypoint_args)
+    cg.kernel.begin_block()
+    _emit_kernel_body(cg, context, build_info, root_params, call_data_len, use_entrypoint_args)
+    cg.kernel.end_block()
+```
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Re-run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` and confirm output is byte-identical to the Step 3b baseline.
+
+---
+
+### Step 3d: Move all codegen symbols from `callsignature.py` to `generator.py` and fix imports
+
+Now that everything is neatly decomposed, do the pure mechanical move:
+
+- **Move** all seven `_emit_*`/`_validate_*`/`_data_name` private helpers and the `generate_code` orchestrator from `callsignature.py` into `generator.py`.
+- **In [callsignature.py](slangpy/core/callsignature.py)**: Delete the moved code; add `from slangpy.core.generator import generate_code` re-export so any consumer that imports `generate_code` from `callsignature` continues to work.
+- **Update [calldata.py](slangpy/core/calldata.py#L8)**: Replace `from slangpy.core.callsignature import *` with explicit imports — binding-pipeline functions from `callsignature`, and `generate_code`, `KernelGenException` from `generator`. This eliminates the wildcard import, making dependencies explicit.
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v` — all tests pass. Re-run `$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v` — output byte-identical to Step 3b baseline.
+
+---
+
+### Step 4: Clean up `callsignature.py`
+
+After Step 3, `callsignature.py` no longer has any codegen functions. Clean up:
+
+- Remove unused imports that were only needed by codegen (`CodeGen`, `PipelineType`, `AccessType`, `NoneMarshall`, `BoundVariableException` if no longer referenced).
+- Remove re-exports of moved symbols once [calldata.py](slangpy/core/calldata.py) uses direct imports from `generator`.
+- Add `from slangpy.core.generator import KernelGenException, ResolveException` re-exports **only if** external consumers import them from `callsignature` (check via grep). If only `calldata.py` uses them, the explicit import is sufficient.
+
+**Verify**: `pytest slangpy/tests/slangpy_tests -v`. `pre-commit run --all-files`.
+
+---
+
+### Step 5: Add comments to `generator.py` sub-functions
+
+Enrich each sub-function's docstring with an example of the Slang code it generates, for both the fast path and fallback path. For example:
+
+```python
+def _emit_shape_and_metadata_params(
+    cg: CodeGen,
+    call_data_len: int,
+    use_entrypoint_args: bool,
+) -> None:
+    """Emit shape arrays and _thread_count.
+
+    Fast path (entry-point params)::
+
+        uniform int[2] _grid_stride
+        uniform int[2] _grid_dim
+        uniform int[2] _call_dim
+        uniform uint3 _thread_count
+
+    Fallback (CallData struct fields)::
+
+        int[2] _grid_stride;
+        int[2] _grid_dim;
+        int[2] _call_dim;
+        uint3 _thread_count;
+    """
+```
+
+This is documentation-only, no functional changes.
+
+**Verify**: `pre-commit run --all-files` (formatting check).
+
+---
+
+### Verification
+
+At each step:
+```bash
+cmake --build --preset windows-msvc-debug
+pytest slangpy/tests/slangpy_tests -v
+pre-commit run --all-files
+```
+
+After Step 3b specifically, capture generated shader output as a baseline; re-run after 3c and 3d to confirm byte-identical output:
+```powershell
+$env:SLANGPY_PRINT_GENERATED_SHADERS="1"; pytest slangpy/tests/slangpy_tests/test_code_gen.py -v
+```
+
+---
+
+### Decisions
+
+- `gen_call_data_code` extracted as free function in `generator.py`; thin delegation stub kept on `BoundVariable` to preserve the method-call interface (`node.gen_call_data_code(cg, context)`) used in `generate_code` and potentially in external/user code.
+- `generator.py` lives at `slangpy/core/generator.py` alongside `callsignature.py` and `calldata.py`.
+- Wildcard import `from slangpy.core.callsignature import *` in `calldata.py` replaced with explicit imports to make dependencies clear.
+- Sub-function names prefixed with `_` (private to the module); only `generate_code`, `generate_constants`, `gen_call_data_code`, `gen_calldata_type_name`, `KernelGenException` are public.
+
+---
+
+### Key Files
+
+| File | Changes |
+|------|---------|
+| [slangpy/core/generator.py](slangpy/core/generator.py) | **NEW** — `generate_code`, `generate_constants`, `gen_call_data_code`, `gen_calldata_type_name`, `KernelGenException`, private helpers |
+| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | Remove `generate_code`, `generate_constants`, `KernelGenException`, `is_slangpy_vector`; add re-exports from `generator` |
+| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | `gen_call_data_code` and `gen_calldata_type_name` become thin delegation stubs; `MAX_INLINE_TYPE_LEN` moves out |
+| [slangpy/core/calldata.py](slangpy/core/calldata.py) | Replace `from slangpy.core.callsignature import *` with explicit imports from `callsignature` and `generator` |
+| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | Import `generate_constants` from `generator` instead of `callsignature` |
diff --git a/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md
new file mode 100644
index 000000000..2778f6071
--- /dev/null
+++ b/.github/prompts/plan-simplifyKernelGen-phase1.prompt.md
@@ -0,0 +1,204 @@
+## Phase 1: Direct Type Marshalling
+
+**Status**: Prim-mode complete (Steps 1.1–1.7, 1.9). Step 1.8 (autodiff derivative fields) deferred.
+
+**Goal**: For dim-0, non-composite arguments, emit the raw Slang type in CallData and use direct assignment in the trampoline — eliminating `ValueType<T>` wrappers, `__slangpy_load`/`__slangpy_store` indirection, mapping constants, and `Context.map()` calls.
+
+**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md)
+
+---
+
+### Architecture
+
+Direct binding eligibility is determined by a **marshall-driven `can_direct_bind` property** combined with a **single depth-first `calculate_direct_bind` pass** on the `BoundVariable` tree. This follows the same pattern as `calculate_differentiability`.
+
+#### Key components
+
+| Component | Location | Role |
+|-----------|----------|------|
+| `Marshall.can_direct_bind(binding)` | `slangpy/bindings/marshall.py` | Virtual method (default `False`). Marshalls override to opt in. |
+| `can_direct_bind_common(binding)` | `slangpy/bindings/boundvariable.py` | Shared eligibility checks (dim-0, no children, no param block). Marshalls call this then add type-specific logic. |
+| `BoundVariable.direct_bind` | `slangpy/bindings/boundvariable.py` | Boolean attribute set by `calculate_direct_bind()`. Consumed by `gen_call_data_code`, `gen_calldata`, `gen_trampoline_load/store`, `create_calldata`. |
+| `BoundVariable.calculate_direct_bind()` | `slangpy/bindings/boundvariable.py` | Depth-first tree pass. Leaves delegate to `marshall.can_direct_bind()`. Composites require all children to be direct-bind AND dim-0 with a concrete vector type. Children retain their individual `direct_bind` status regardless of the parent's eligibility. |
+| `calculate_direct_binding(call)` | `slangpy/core/callsignature.py` | Top-level function iterating `call.args` + `call.kwargs.values()`, calling `arg.calculate_direct_bind()`. |
+| `NativeBoundVariableRuntime.direct_bind` | `slangpy.h` / `boundvariableruntime.py` | C++ member + Python propagation. Read by `NativeValueMarshall::ensure_cached` to gate `["value"]` sub-field navigation. |
+
+#### Control flow
+
+```
+CallData.build()
+  → calculate_differentiability(context, bindings)
+  → calculate_direct_binding(bindings)           ← NEW
+  → generate_code(...)
+      → gen_call_data_code()    — reads binding.direct_bind
+      → gen_trampoline()        — reads binding.direct_bind
+  → BoundCallRuntime(bindings)  — propagates binding.direct_bind to C++ runtime
+```
+
+At dispatch time, `NativeValueMarshall::ensure_cached()` reads `binding->direct_bind()` to decide cursor navigation:
+- `direct_bind == false`: `cursor[variable_name]["value"]` (wrapper path)
+- `direct_bind == true`: `cursor[variable_name]` (raw type path)
+
+#### Composite (struct/dict) handling
+
+When `calculate_direct_bind()` visits a composite node:
+1. Recurse children first (depth-first)
+2. If all children have `direct_bind == True` AND the composite is dim-0 with a concrete vector type → set `self.direct_bind = True`
+3. Otherwise → the composite is NOT direct-bind, but children **retain** their individual `direct_bind` status. Inside the parent's generated `__slangpy_load`/`__slangpy_store`, `gen_call_data_code` delegates to each child's `gen_trampoline_load`/`gen_trampoline_store` — direct-bind children get direct assignment (e.g., `value.y = y;`) while non-direct-bind children use the standard `__slangpy_load(context.map(...))` path. This allows mixed direct-bind / non-direct-bind children within the same struct.
+
+---
+
+### Step 1.1: Define eligibility predicate
+
+**Implemented.** A `can_direct_bind(binding)` virtual method on `Marshall` (default `False`) replaces the original `is_direct_bind_eligible` / `is_direct_bind_recursive` global functions. Each marshall subclass overrides `can_direct_bind` to opt in.
+
+A shared helper `can_direct_bind_common(binding)` in `boundvariable.py` provides the common checks:
+- `binding.call_dimensionality is not None and binding.call_dimensionality == 0`
+- `not binding.children` (not composite/dict)
+- `not getattr(binding, "create_param_block", False)` (excludes `PackedArg`)
+
+Marshall subclasses call `can_direct_bind_common(binding)` and optionally add type-specific logic. `StructMarshall` has its own implementation: if it has children, all children must have `direct_bind == True`; otherwise it delegates to `can_direct_bind_common`. `ValueRefMarshall` additionally requires `binding.access[0] == AccessType.read` — writable value refs need buffer read/write logic that is incompatible with direct binding.
+
+---
+
+### Step 1.2: Implement for `ValueMarshall` (scalars/matrices)
+
+**Implemented.** In [slangpy/builtin/value.py](slangpy/builtin/value.py):
+
+- `can_direct_bind(binding)`: calls `can_direct_bind_common(binding)`
+- `gen_calldata`: when `binding.direct_bind`, emits `typealias _t_{name} = {raw_slang_type}` instead of `ValueType<{type}>`
+- `gen_trampoline_load`: when `binding.direct_bind`, emits `{name} = {data_name}` and returns `True`
+- `gen_trampoline_store`: when `binding.direct_bind` (read-only), returns `True` (suppress default store)
+- `create_calldata`: when `binding.direct_bind`, returns raw value instead of `{"value": data}`
+
+#### Step 1.2a: C++ fast path
+
+**Implemented.** `NativeValueMarshall::ensure_cached` in [slangpyvalue.cpp](src/slangpy_ext/utils/slangpyvalue.cpp) reads `binding->direct_bind()` from the `NativeBoundVariableRuntime`:
+
+```cpp
+ShaderCursor field = binding->direct_bind()
+    ? cursor[binding->variable_name()]
+    : cursor[binding->variable_name()]["value"];
+```
+
+The `direct_bind` flag is a `bool` member on `NativeBoundVariableRuntime` (declared in [slangpy.h](src/slangpy_ext/utils/slangpy.h)), exposed via nanobind property in [slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), and propagated from `BoundVariable.direct_bind` via [boundvariableruntime.py](slangpy/bindings/boundvariableruntime.py).
+
+The `m_direct_bind` / `direct_bind` / `set_direct_bind` members were **removed** from `NativeValueMarshall` — the flag lives exclusively on `NativeBoundVariableRuntime`.
+
+---
+
+### Step 1.3: Implement for `VectorMarshall`, `MatrixMarshall`, and `ArrayMarshall`
+
+**Implemented.** All inherit `can_direct_bind` and `gen_trampoline_load`/`gen_trampoline_store` from `ValueMarshall`. `VectorMarshall` overrides `gen_calldata` to emit the raw vector type (e.g., `vector<float,3>`) instead of `VectorValueType<float,3>` when `binding.direct_bind`. `MatrixMarshall` and `ArrayMarshall` (at dim-0) inherit `ValueMarshall.gen_calldata`.
+
+---
+
+### Step 1.4: Implement for `StructMarshall` (dict → struct)
+
+**Implemented.** In [slangpy/builtin/struct.py](slangpy/builtin/struct.py):
+
+- `can_direct_bind(binding)`: if `binding.children is not None`, returns `True` only if all children have `direct_bind == True`. Otherwise delegates to `can_direct_bind_common(binding)`.
+- `gen_trampoline_load`: when `binding.direct_bind`, delegates to `ValueMarshall.gen_trampoline_load` (emits `{name} = {data_name}`) and returns `True`. Direct-bind structs are read-only, like other value types.
+- `gen_trampoline_store`: when `binding.direct_bind`, delegates to `ValueMarshall.gen_trampoline_store` (suppresses store for read-only). Returns `True`.
+
+In [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py), `gen_call_data_code`:
+- When `self.direct_bind`, emits `typealias _t_{name} = {vector_type.full_name}` (raw struct type) — skipping inline struct generation, `__slangpy_load`/`__slangpy_store`, and child type aliases.
+- When NOT `self.direct_bind`, uses the standard children path with inline struct. Children **retain** their individual `direct_bind` status — `gen_call_data_code` calls each child's `gen_trampoline_load`/`gen_trampoline_store`, which emit direct assignment for direct-bind children and fall through to `__slangpy_load`/`__slangpy_store` for non-direct-bind children.
+
+---
+
+### Step 1.5: Implement for `ValueRefMarshall`
+
+**Implemented.** In [slangpy/builtin/valueref.py](slangpy/builtin/valueref.py):
+
+- `can_direct_bind(binding)`: calls `can_direct_bind_common(binding)` AND requires `binding.access[0] == AccessType.read`. Writable value refs are NOT direct-bind eligible because they need buffer allocation and readback logic that requires the wrapper path.
+- `gen_calldata`: when `binding.direct_bind`, emits raw type alias (read-only only). Non-direct-bind uses `ValueRef<T>` / `RWValueRef<T>` as before.
+- `gen_trampoline_load`: when `binding.direct_bind`, emits direct assignment. Non-direct-bind falls through.
+- `gen_trampoline_store`: when `binding.direct_bind`, returns `True` (suppress store for read-only). Non-direct-bind falls through.
+- `create_calldata` / `read_calldata`: when `binding.direct_bind` AND read-only, returns raw value / skips readback.
+
+The old `self._direct_bind` attribute was **removed** — all checks now use `binding.direct_bind`.
+
+**Implication for `_result`:** Auto-created return values are writable `ValueRef` instances. Since writable value refs are not direct-bind eligible, `_result` uses `RWValueRef<T>` with `__slangpy_store`, mapping constants, and the standard wrapper path. This is a deliberate constraint — writable value refs inside structs would prevent the struct from being direct-bind eligible, which is the correct behavior since the struct's `__slangpy_load`/`__slangpy_store` must exist to handle the buffer operations.
+
+---
+
+### Step 1.6: Implement for tensor marshalls
+
+**Implemented.** In [slangpy/builtin/tensorcommon.py](slangpy/builtin/tensorcommon.py):
+
+`gen_trampoline_load/store` extended for `ITensorType` at dim-0 (direct struct assignment). Tensor marshalls do NOT implement `can_direct_bind` — tensor dim-0 handling is done via trampoline-level checks on `binding.call_dimensionality` and `binding.vector_type` type, independent of the `direct_bind` flag.
+
+---
+
+### Step 1.7: Eliminate unused boilerplate in code generation
+
+**Implemented.** In [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py), `gen_call_data_code` skips emitting `static const int _m_{name} = 0` mapping constants when `self.direct_bind` is `True`.
+
+---
+
+### Step 1.8: Handle autodiff (bwds mode)
+
+⬜ **Deferred.** Prim-mode direct binding applies to bwds primals (code gen verified), but derivative fields still use the old `ValueType` wrapper path.
+
+---
+
+### Step 1.9: Tests
+
+**Implemented.** 21 tests × 3 device types = 63 cases. All pass on d3d12/vulkan/cuda.
+
+---
+
+### Files Modified
+
+| File | Changes |
+|------|---------|
+| `src/slangpy_ext/utils/slangpy.h` | `m_direct_bind` member, `direct_bind()` getter, `set_direct_bind()` setter on `NativeBoundVariableRuntime` |
+| `src/slangpy_ext/utils/slangpy.cpp` | Nanobind `direct_bind` property on `NativeBoundVariableRuntime` |
+| `src/slangpy_ext/utils/slangpyvalue.h` | `m_direct_bind`, `direct_bind()`, `set_direct_bind()` **removed** from `NativeValueMarshall` |
+| `src/slangpy_ext/utils/slangpyvalue.cpp` | `ensure_cached` reads `binding->direct_bind()` instead of `m_direct_bind`; nanobind `direct_bind` property **removed** from `NativeValueMarshall` |
+| `slangpy/bindings/marshall.py` | `can_direct_bind(binding)` virtual method (default `False`) |
+| `slangpy/bindings/boundvariable.py` | `can_direct_bind_common()`, `BoundVariable.direct_bind` attribute, `BoundVariable.calculate_direct_bind()`. Old functions removed: `is_direct_bind_eligible`, `is_direct_bind_recursive`, `_set_direct_bind_on_children`, `_force_no_direct_bind`, `_DIRECT_BIND_TYPES`, `_clear_direct_bind()`. |
+| `slangpy/bindings/boundvariableruntime.py` | `self.direct_bind = source.direct_bind` propagation |
+| `slangpy/bindings/__init__.py` | Exports `can_direct_bind_common` (removed `is_direct_bind_eligible`, `is_direct_bind_recursive`) |
+| `slangpy/core/callsignature.py` | `calculate_direct_binding(call)` function |
+| `slangpy/core/calldata.py` | `calculate_direct_binding(bindings)` call after `calculate_differentiability` |
+| `slangpy/builtin/value.py` | `can_direct_bind`, `gen_calldata`, `gen_trampoline_load`, `gen_trampoline_store`, `create_calldata` use `binding.direct_bind`. Removed `self.direct_bind` on marshall. |
+| `slangpy/builtin/valueref.py` | `can_direct_bind`, `gen_calldata`, `gen_trampoline_load`, `gen_trampoline_store`, `create_calldata`, `read_calldata` use `binding.direct_bind`. Removed `self._direct_bind`. |
+| `slangpy/builtin/struct.py` | `can_direct_bind`, `gen_trampoline_load`, `gen_trampoline_store` use `binding.direct_bind` |
+| `slangpy/builtin/tensorcommon.py` | `gen_trampoline_load`, `gen_trampoline_store` extended for `ITensorType` dim-0 (unchanged in refactor) |
+| `slangpy/tests/slangpy_tests/test_kernel_gen.py` | All Phase 1 tests |
+
+### Test Results
+
+2952 passed / 0 failed in `slangpy/tests/slangpy_tests`. 6 pre-existing failures in `slangpy/tests/device/` (raytracing pipeline, type conformance cache — unrelated).
+
+### Review Notes
+
+**Issues to address before merge:**
+
+1. **`StructMarshall.can_direct_bind` children branch is dead code.** `calculate_direct_bind()` handles composites directly (when `self.children is not None`) and never calls the marshall's `can_direct_bind`. The `if binding.children is not None:` branch in `StructMarshall.can_direct_bind` is unreachable. Fix: remove the children branch or have `calculate_direct_bind` delegate to the marshall for composites.
+
+2. **Composite direct-bind should gate on read-only access.** Add `and self.access[0] == AccessType.read` to the composite branch in `calculate_direct_bind()` (matching `ValueRefMarshall` pattern). Without this, a writable dim-0 composite would be incorrectly marked direct-bind.
+
+3. **Dead `binding.direct_bind` checks in writable ValueRef paths** ([valueref.py](slangpy/builtin/valueref.py) lines ~215, ~230, ~248). Since `can_direct_bind` rejects non-read access, these branches are unreachable. Remove or add `assert not binding.direct_bind` to make the invariant explicit.
+
+4. **Overly defensive `hasattr` guard** in `calculate_direct_bind()` — `hasattr(self.python, "can_direct_bind")` is unnecessary since `Marshall` base class always defines this method.
+
+5. **Benchmark file** — `test_benchmark_autograd.py` has accidental local changes that should be reverted.
+
+6. **C++ improvements** — Add debug assertion in `NativeValueMarshall::ensure_cached` verifying cached `direct_bind` matches binding's; consider making `NativeBoundVariableRuntime.direct_bind` read-only in nanobind.
+
+**Missing tests to add:** Writable ValueRef inout, `_result` binding flag, all-scalar struct binding flag, struct+WangHashArg child, WangHashArg binding flag, functional read-only ValueRef, bwds binding flags. See parent plan for full table.
+
+### Design Decisions
+
+**`direct_bind` lives on `NativeBoundVariableRuntime`, not `NativeValueMarshall`.** The original implementation stored `m_direct_bind` on the marshall itself (`NativeValueMarshall`), but marshalls are shared across calls while bindings are per-call. Moving the flag to the binding makes it immutable per-call and eliminates mutable state on shared marshall instances.
+
+**Marshall-driven `can_direct_bind` replaces hardcoded type list.** The original `is_direct_bind_eligible` used a lazily-populated `_DIRECT_BIND_TYPES` tuple to check marshall type. The new design uses a virtual method — each marshall opts in explicitly. Adding a new direct-bind-eligible type requires only overriding `can_direct_bind` on the new class.
+
+**Single `calculate_direct_bind` pass replaces repeated predicate calls.** The original `is_direct_bind_eligible` / `is_direct_bind_recursive` were called multiple times per variable during code gen. The new design computes `direct_bind` once in a single tree pass after `calculate_differentiability`, and consumers simply read the boolean.
+
+**Children retain `direct_bind` in non-direct-bind composites.** When a composite struct is NOT direct-bind-eligible (e.g., has vectorized children), children **retain** their individual `direct_bind` status. The parent's `gen_call_data_code` delegates to each child's `gen_trampoline_load`/`gen_trampoline_store` — direct-bind children emit direct assignment (e.g., `value.y = y;`) within the parent's `__slangpy_load`, while non-direct-bind children use the standard `__slangpy_load(context.map(...))` path. The old `_clear_direct_bind()` / `_force_no_direct_bind` approach was removed.
+
+**Writable ValueRef excluded from direct binding.** Writable value refs require buffer allocation, GPU readback, and `__slangpy_store` indirection. Only read-only value refs (`access[0] == AccessType.read`) are direct-bind eligible. This means auto-created `_result` (which is writable) always uses the `RWValueRef<T>` wrapper path.
diff --git a/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md
new file mode 100644
index 000000000..61f4f08ee
--- /dev/null
+++ b/.github/prompts/plan-simplifyKernelGen-phase2.prompt.md
@@ -0,0 +1,442 @@
+## Phase 2: Eliminate CallData Struct
+
+**Goal**: Move kernel uniforms out of the `CallData` struct into individual entry-point parameters. Eliminate the trampoline in forward (prim) mode. Fall back to `ParameterBlock<CallData>` when total inline-uniform size exceeds a runtime per-device threshold.
+
+**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md)
+
+**Status**: Steps 2.0–2.2 and 2.4 complete. Step 2.3 (trampoline elimination for prim mode) not started. Code generation logic has been extracted from `callsignature.py` into [generator.py](slangpy/core/generator.py) (see [plan-extractCodegenToGenerator.prompt.md](plan-extractCodegenToGenerator.prompt.md)).
+
+---
+
+### Key Architectural Decisions
+
+These decisions correct several assumptions in the original plan:
+
+1. **Entry-point param placement is orthogonal to `direct_bind`.** Any type — wrapped or raw — can be an entry-point parameter (e.g., `uniform ValueType<int> a` or `uniform int a` or `uniform Tensor<float,2> t`). `direct_bind` governs whether `__slangpy_load`/`__slangpy_store` is needed inside the kernel; entry-point placement governs where the uniform lives in the shader layout.
+
+2. **Trampoline elimination is independent of `direct_bind`.** The current trampoline body is: declare locals → load (direct assignment or `__slangpy_load`) → call function → store (`__slangpy_store`). All of that can appear directly in `compute_main`. The trampoline only exists because bwds mode needs a `[Differentiable]` wrapper for `bwd_diff()`. In prim mode, it is eliminated regardless of whether args use wrappers.
+
+3. **All-or-nothing fallback.** When total inline-uniform size exceeds the platform threshold, ALL args go back into `ParameterBlock<CallData>` (the current path). No hybrid mixing of entry-point params and CallData.
+
+4. **Shape arrays and `_thread_count` obey the same rules** as user args — they become entry-point params by default, and go into `CallData` on fallback. Phase 2 is NOT scoped only to `call_data_len == 0`.
+
+5. **Two code paths based on where data lives:**
+   - **Fast path** (entry-point params): In Slang, uniforms are entry-point parameters and can be used directly (in forward) or passed directly to the trampoline (in backward).
+   - **Fallback path** (`ParameterBlock<CallData>`): In Slang, uniforms live in a `CallData` struct. They must be read into local variables before being used (in forward) or passed to the trampoline (in backward). This is the current behavior.
+
+6. **C++ dispatch changes are isolated to `NativeCallData::exec`.** Marshalls receive a `ShaderCursor` pointing to wherever their data lives — they don't care whether it's inside a `CallData` struct or an entry-point param. In the fast path, `m_runtime->write_shader_cursor_pre_dispatch()` receives the entry-point cursor directly. No marshall code changes needed.
+
+7. **`CallDataMode` is eliminated.** The `global_data` vs `entry_point` distinction is removed entirely. On the fast path, all backends use entry-point params uniformly. On the fallback path, all backends use `ParameterBlock<CallData>` — CUDA supports `ParameterBlock` and in practice will never hit the fallback (CUDA's inline-uniform limit is ~4KB). This removes the `CallDataMode` enum, the CUDA-specific `is_entry_point` codegen branch in `callsignature.py`/`generator.py`, and the corresponding C++ branch in `slangpy.cpp`.
+
+8. **`PackedArg` / param-block types are unchanged.** They stay as `ParameterBlock<T>` at module scope, orthogonal to Phase 2.
+
+---
+
+### Code Organization (post-extraction)
+
+All code generation logic now lives in [generator.py](slangpy/core/generator.py). [callsignature.py](slangpy/core/callsignature.py) retains binding-pipeline functions (`specialize`, `bind`, `calculate_*`, `estimate_entrypoint_arguments_size`, etc.) and re-exports `generate_code`, `generate_constants`, `KernelGenException` from `generator.py` for backward compatibility.
+
+| File | Role |
+|------|------|
+| [generator.py](slangpy/core/generator.py) | All code emission: `generate_code()`, `_emit_trampoline()`, `_emit_entry_point_signature()`, `_emit_kernel_body()`, `_emit_shape_and_metadata_params()`, `_emit_link_time_constants()`, `_emit_call_data_definitions()`, `_emit_trampoline_loads/stores()`, `_data_name()`, `_validate_and_compute_group_shape()`, `gen_call_data_code()`, `gen_calldata_type_name()`, `KernelGenException` |
+| [callsignature.py](slangpy/core/callsignature.py) | Binding pipeline: `specialize()`, `bind()`, `apply_explicit_vectorization()`, `apply_implicit_vectorization()`, `finalize_mappings()`, `calculate_differentiability()`, `calculate_direct_binding()`, `estimate_entrypoint_arguments_size()`, `calculate_call_dimensionality()`, `create_return_value_binding()` |
+| [calldata.py](slangpy/core/calldata.py) | `CallData` class orchestrating build pipeline; wildcard-imports from `callsignature.py` |
+| [codegen.py](slangpy/bindings/codegen.py) | `CodeGen` class with `skip_call_data`, `entry_point_params` attributes |
+| [boundvariable.py](slangpy/bindings/boundvariable.py) | `BoundVariable` methods delegate to `gen_call_data_code()` and `gen_calldata_type_name()` in `generator.py` |
+
+---
+
+### Current Kernel Structure (post-Phase 1)
+
+For `int add(int a, int b)` with scalar args `(1, 2)`:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_a = int;             // Phase 1: raw type (was ValueType<int>)
+typealias _t__result = RWValueRef<int>;  // writable _result still wrapped
+static const int _m__result = 0;         // mapping constant only for _result
+
+struct CallData {
+    _t_a a;
+    _t_a b;
+    _t__result _result;
+    uint3 _thread_count;
+};
+
+void _trampoline(Context __slangpy_context__, CallData __calldata__) {
+    int a;
+    a = __calldata__.a;            // Phase 1: direct assignment
+    int b;
+    b = __calldata__.b;            // Phase 1: direct assignment
+    int _result;
+    _result = add(a, b);
+    __calldata__._result.__slangpy_store(__slangpy_context__.map(_m__result), _result);
+}
+
+[shader("compute")] [numthreads(32,1,1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ..., uniform CallData call_data) {
+    if (any(flat_call_thread_id >= call_data._thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    _trampoline(__slangpy_context__, call_data);
+}
+```
+
+### Target Kernel (Phase 2 fast path, prim mode, all direct-bind)
+
+```slang
+import "module";
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 tid: SV_DispatchThreadID,
+    uniform uint3 _thread_count,
+    uniform int a,
+    uniform int b,
+    uniform RWStructuredBuffer<int> _result)
+{
+    if (any(tid >= _thread_count)) return;
+    _result[0] = add(a, b);
+}
+```
+
+### Target Kernel (Phase 2 fast path, prim mode, mixed direct/non-direct-bind)
+
+When some args are not direct-bind (e.g., WangHashArg needs per-thread `thread_id` via `__slangpy_load`), the non-direct-bind args still use their wrapper types as entry-point params. Context is needed:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_rng = WangHashArgType;  // non-direct-bind wrapper type
+static const int _m_rng = 0;
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID,
+    uniform uint3 _thread_count,
+    uniform _t_rng rng,
+    uniform int x,
+    uniform RWStructuredBuffer<int> _result)
+{
+    if (any(flat_call_thread_id >= _thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    int _rng_val;
+    rng.__slangpy_load(__slangpy_context__.map(_m_rng), _rng_val);
+    int _x_val;
+    _x_val = x;
+    int _result_val;
+    _result_val = func(_rng_val, _x_val);
+    _result[0] = _result_val;
+}
+```
+
+### Target Kernel (Phase 2 fallback path, prim mode)
+
+When entry-point param size exceeds the platform limit, all args go into `ParameterBlock<CallData>`. The trampoline is still eliminated in prim mode — the load/call/store is inlined into `compute_main`, reading from `call_data`:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_a = int;
+typealias _t__result = RWValueRef<int>;
+static const int _m__result = 0;
+
+struct CallData {
+    _t_a a;
+    _t_a b;
+    _t__result _result;
+    uint3 _thread_count;
+};
+ParameterBlock<CallData> call_data;
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ...) {
+    if (any(flat_call_thread_id >= call_data._thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    int a;
+    a = call_data.a;
+    int b;
+    b = call_data.b;
+    int _result;
+    _result = add(a, b);
+    call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result);
+}
+```
+
+---
+
+### Step 2.0: Gating tests ✅
+
+**Status: DONE**
+
+Tests added to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py). All 21 parametrized cases (7 tests × 3 device types) pass.
+
+| Test | Source | Args | Original assertion | Status |
+|------|--------|------|--------------------|--------|
+| `test_gate_p2_calldata_struct_present` | `int add(int a, int b)` | `(1, 2)` | `struct CallData` in code | ✅ Flipped — now asserts `struct CallData` ABSENT (Step 2.2 done) |
+| `test_gate_p2_calldata_uniform_param` | same | same | `uniform CallData call_data` or `ParameterBlock<CallData>` | ✅ Flipped — now asserts both ABSENT (Step 2.2 done) |
+| `test_gate_p2_thread_count_in_calldata` | same | same | `call_data._thread_count` | ✅ Flipped — now asserts ABSENT (Step 2.2 done) |
+| `test_gate_p2_trampoline_present_for_prim` | same | same | `void _trampoline(` present | Still asserts present (Step 2.3 pending) |
+| `test_gate_p2_kernel_calls_trampoline` | same | same | `_trampoline(` in `compute_main` body | Still asserts present (Step 2.3 pending) |
+| `test_gate_p2_sv_group_id_present` | same | same | `SV_GroupID` in `compute_main` signature | ✅ Flipped — now asserts ABSENT for dim-0 calls (Step 2.2 done) |
+
+Negative gates (must stay passing after Phase 2):
+
+| Test | Asserts |
+|------|---------|
+| `test_gate_p2_wanghasharg_keeps_load` | Non-direct-bind arg still uses `__slangpy_load` |
+
+Bwds gates:
+
+| Test | Status |
+|------|--------|
+| `test_gate_scalar_uses_valuetype` | ✅ Passing — asserts fast-path trampoline with `__in_` prefix params |
+| `test_gate_bwds_scalar_uses_valuetype` | ✅ Passing — bwds trampoline has `no_diff` on all params (Step 2.4 done) |
+
+---
+
+### Step 2.1: Determine fast vs fallback path ✅
+
+**Status: DONE**
+
+In [slangpy/core/calldata.py](slangpy/core/calldata.py), after `calculate_direct_binding(bindings)`:
+
+1. **Query a runtime per-device threshold** for max entry-point parameter inline-uniform size. This is a property of the device/backend — large for D3D12/CUDA (thousands of bytes), potentially as low as 128–256 bytes on Vulkan.
+2. **Accumulate inline-uniform byte size** of each bound variable's `calldata_type_name`, plus `_thread_count` (12 bytes) and shape arrays (`call_data_len * 3 * sizeof(int)` for `_grid_stride`, `_grid_dim`, `_call_dim`). **Resource types** (`RWStructuredBuffer`, `Texture2D`, `TensorView`, etc.) don't count — they are bound as descriptors, not inline data.
+3. **Decision**: If total size ≤ threshold → `self.use_entrypoint_args = True` (fast path). Otherwise → `self.use_entrypoint_args = False` (fallback path — current behavior).
+4. **Store** `use_entrypoint_args` on the `CallData` instance and propagate to C++ `NativeCallData`.
+
+`PackedArg` / param-block types are excluded from this accounting — they stay as `ParameterBlock<T>` regardless.
+
+**Implementation details:**
+
+- `DeviceLimits.max_entry_point_uniform_size` added to C++ struct ([device.h](src/sgl/device/device.h)) with per-backend defaults: Vulkan=128, D3D12=256, CUDA=4096 bytes ([device.cpp](src/sgl/device/device.cpp)).
+- `estimate_entrypoint_arguments_size()` in [callsignature.py](slangpy/core/callsignature.py) — sums `vector_type.uniform_layout.size` for each depth-0 bound variable (skipping `PackedArg`), plus 12 bytes for `_thread_count` and `call_dimensionality * 4 * 3` for shape arrays.
+- `use_entrypoint_args` property added to `NativeCallData` C++ class ([slangpy.h](src/slangpy_ext/utils/slangpy.h)) with Python binding.
+- `CallData.__init__()` in [calldata.py](slangpy/core/calldata.py) sets `self.use_entrypoint_args = inline_size <= threshold` after `calculate_direct_binding()`.
+
+**Tests** (7 tests × 3 device types = 21 parametrized cases, all pass):
+
+| Test | Asserts |
+|------|---------|
+| `test_step21_scalar_uses_entrypoint_args` | Simple `int add(int,int)` with `(1,2)` → `use_entrypoint_args=True` |
+| `test_step21_threshold_property_positive` | `device.info.limits.max_entry_point_uniform_size > 0` |
+| `test_step21_vector_uses_entrypoint_args` | `float3` args → `use_entrypoint_args=True` |
+| `test_step21_struct_uses_entrypoint_args` | All-scalar struct dict → `use_entrypoint_args=True` |
+| `test_step21_tensor_uses_entrypoint_args` | Tensor (descriptor-only, 0 inline bytes) → `use_entrypoint_args=True` |
+| `test_step21_many_float4x4_may_exceed_vulkan` | 8×float4x4 (524 bytes) exceeds Vulkan/D3D12 thresholds, not CUDA |
+| `test_step21_wanghasharg_uses_entrypoint_args` | Non-direct-bind WangHashArg with small inline size → `use_entrypoint_args=True` |
+
+---
+
+### Step 2.2: Code generation — entry-point params (fast path) ✅
+
+**Status: DONE**
+
+In [generator.py](slangpy/core/generator.py) `generate_code()` (line 778), when `use_entrypoint_args == True`:
+
+**CodeGen changes** in [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py):
+- `self.skip_call_data: bool = False` — when `True`, don't emit `struct CallData` / `begin_block()` and gate `end_block()` in `finish()`.
+- `self.entry_point_params: list[str] = []` — collects individual uniform param declarations.
+- `finish()` ignores the `call_data` block and `use_param_block_for_call_data` when `skip_call_data` is set.
+
+**CallData struct elimination**: `generate_code()` sets `cg.skip_call_data = True` when `use_entrypoint_args`. No `struct CallData` emitted.
+
+**`_emit_call_data_code`** in [generator.py](slangpy/core/generator.py#L345): At `depth == 0`, when `use_entrypoint_args`, appends to `cg.entry_point_params` instead of `cg.call_data.declare(...)`. The `call_data_structs` block (type aliases, wrapper structs, mapping constants) still gets emitted at module scope.
+
+**`_thread_count` and shape arrays**: `_emit_shape_and_metadata_params()` ([generator.py](slangpy/core/generator.py#L466)) appends to `cg.entry_point_params` instead of `cg.call_data`. Same for `_grid_stride`, `_grid_dim`, `_call_dim` when `call_data_len > 0`.
+
+**Entry-point signature**: `_emit_entry_point_signature()` ([generator.py](slangpy/core/generator.py#L652)) emits `compute_main` signature as:
+```slang
+void compute_main(
+    int3 flat_call_thread_id: SV_DispatchThreadID,
+    [int3 flat_call_group_id: SV_GroupID,]          // only when call_data_len > 0
+    [int flat_call_group_thread_id: SV_GroupIndex,]  // only when call_data_len > 0
+    uniform uint3 _thread_count,
+    [uniform int[N] _grid_stride, ...]               // only when call_data_len > 0
+    uniform _t_a a,
+    uniform _t_b b,
+    uniform _t__result _result
+)
+```
+
+Drop `SV_GroupID` and `SV_GroupIndex` when `call_data_len == 0` — they feed `init_thread_local_call_shape_info` which isn't called when there are no shape arrays.
+
+**Bounds check**: Changes from `call_data._thread_count` to just `_thread_count`.
+
+**Shape info init**: Changes from `call_data._grid_stride` etc. to just `_grid_stride`, `_grid_dim`, `_call_dim`.
+
+**Fallback path** (`use_entrypoint_args == False`): `struct CallData` is emitted with `ParameterBlock<CallData> call_data` at module scope on ALL backends (including CUDA). The old `CallDataMode` distinction between `entry_point` (CUDA) and `global_data` (non-CUDA) is removed — `ParameterBlock` works on CUDA, and in practice CUDA will never hit the fallback due to its large (~4KB) inline-uniform limit.
+
+See [slangpy/tests/device/test_pipeline_utils.slang](slangpy/tests/device/test_pipeline_utils.slang) for examples of manually-written compute shaders that use entry point parameters on all backends (CUDA, Vulkan, D3D12).
+
+---
+
+### Step 2.3: Trampoline elimination for prim mode
+
+**Status: NOT STARTED** — Trampoline is still generated for prim mode on both paths. The load/call/store sequence needs to be inlined into `compute_main`.
+
+When `call_mode == prim` — on **both** fast and fallback paths:
+
+- Don't generate the `_trampoline` function.
+- Inline the load/call/store sequence directly into `compute_main` after the bounds check and (if needed) Context construction.
+- The load/call/store codegen reuses the same logic currently in `_emit_trampoline_loads()` ([generator.py](slangpy/core/generator.py#L528)) and `_emit_trampoline_stores()` ([generator.py](slangpy/core/generator.py#L551)), but emitted into `cg.kernel` instead of `cg.trampoline` with adjusted `data_name` from `_data_name()` ([generator.py](slangpy/core/generator.py#L513)):
+
+| Path | `data_name` for non-param-block args |
+|------|-------------------------------------|
+| Fast | `x.variable_name` (entry-point param name directly) |
+| Fallback | `call_data.{x.variable_name}` (global `ParameterBlock<CallData>`, all backends) |
+| Param blocks | `_param_{x.variable_name}` (unchanged) |
+
+**Context construction**: Needed only when any arg is non-direct-bind (i.e., calls `__slangpy_load`/`__slangpy_store`). When all args satisfy `direct_bind == True`, skip Context construction entirely — no `Context __slangpy_context__` declaration, no `import "slangpy"`.
+
+**Key functions to modify in [generator.py](slangpy/core/generator.py)**:
+- `_emit_trampoline()` (line 578): gate on `call_mode != prim` — only emit for bwds mode.
+- `_emit_kernel_body()` (line 708): when prim mode, inline the load/call/store sequence directly instead of calling `_trampoline()`.
+- `generate_code()` (line 778): skip `_emit_trampoline()` call when prim mode.
+
+**Note**: The trampoline elimination does NOT depend on `direct_bind`. Even non-direct-bind args with `__slangpy_load` work inline in `compute_main` — the `__slangpy_load` call just needs the data reference and a `Context` value, both available in `compute_main`.
+
+---
+
+### Step 2.4: Trampoline with individual params for bwds mode ✅
+
+**Status: DONE** — Fast-path trampoline takes individual params with `no_diff` on all params. All 3 device types pass.
+
+When `call_mode == bwds`:
+
+- Still generate a `[Differentiable]` trampoline function via `_emit_trampoline()` ([generator.py](slangpy/core/generator.py#L578)).
+- **Fast path**: Trampoline takes individual params instead of a struct. All params get `no_diff` — entry-point uniforms are never differentiable. Differentiation happens through local variable assignments inside the trampoline body, matching the struct-based approach where `CallData` was implicitly non-differentiable. No `in`/`out`/`inout` modifiers are added — `compute_main` passes its uniforms straight through:
+  ```slang
+  [Differentiable]
+  void _trampoline(Context __slangpy_context__, no_diff float __in_a, no_diff float __in_b, no_diff NoneType __in__result)
+  ```
+  `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, a, b, _result)` passing entry-point param names directly.
+- **Fallback path**: Trampoline reads from global `ParameterBlock<CallData> call_data` as it does today (on all backends). `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, call_data)`.
+- `_gen_trampoline_argument()` in `boundvariable.py` remains unused dead code — the inline generation in `_emit_trampoline()` ([generator.py](slangpy/core/generator.py#L578)) is simpler and avoids the `in`/`out`/`inout` modifiers that caused Slang autodiff errors.
+
+**Key insight**: Adding `in`/`out`/`inout` modifiers to trampoline params caused Slang autodiff issues (e.g., `out` params get reversed to `in` by `bwd_diff`, changing arity). The trampoline params are just pass-through uniforms — all data flow logic (loads, stores, differentiation) is handled internally via local variables.
+
+---
+
+### Step 2.5: C++ dispatch changes ✅
+
+**Status: DONE** — `CallDataMode` enum fully removed. Fast path uses `find_entry_point(0)` on all backends. Fallback path uses global `ParameterBlock<CallData>` on all backends.
+
+In [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), store `m_use_entrypoint_args` on `NativeCallData` (received from Python `CallData`). Also add to [slangpy.h](src/slangpy_ext/utils/slangpy.h).
+
+Modify `bind_call_data` lambda in `exec()`:
+
+**Fast path** (`m_use_entrypoint_args == true`):
+- All backends: Navigate via `cursor.find_entry_point(0)`. This is the entry-point cursor.
+- Write `_thread_count` as an entry-point param: `entry_point_cursor["_thread_count"]`.
+- Write shape arrays as entry-point params: `entry_point_cursor["_grid_stride"]`, etc.
+- Pass `entry_point_cursor` as the `call_data_cursor` argument to `m_runtime->write_shader_cursor_pre_dispatch()`. Each `NativeBoundVariableRuntime` already navigates `cursor[m_variable_name]`, so it finds the entry-point param by name automatically. **No marshall code changes needed.**
+- Cache entry-point param field indices on first call (analogous to existing `m_cached_call_data_offsets`).
+- The `reserve_data` + raw-pointer optimization for `_thread_count` and shape arrays may not work for individual entry-point params at disjoint offsets. Use cursor-based writes for these metadata fields (they're small, performance impact minimal), or check if `reserve_data` still works across the entry-point shader object.
+
+**Fallback path** (`m_use_entrypoint_args == false`):
+- All backends: Navigate to global `call_data` field via `cursor.find_field("call_data")`, dereference (it's a `ParameterBlock`), write struct data. The old `CallDataMode` branch (CUDA using `find_entry_point(0)` for call_data) is removed. Remove `m_call_data_mode`, `CallDataMode` enum, and all associated branches from `slangpy.h`, `slangpy.cpp`, `calldata.py`, and `callsignature.py`.
+
+---
+
+### Step 2.6: `_result` handling
+
+**Status: NOT STARTED**
+
+Auto-created `_result` is a writable `ValueRef`, currently NOT direct-bind eligible (needs `RWValueRef<T>` wrapper with buffer logic). Phase 2 handles this differently on the two paths:
+
+**Fast path**: `_result` is emitted as `uniform RWValueRef<int> _result` on the entry point. In prim mode, the inlined code stores via `_result.__slangpy_store(...)`. In the all-direct-bind case where Context is omitted, add a new code path: emit `uniform RWStructuredBuffer<T> _result` with `_result[0] = value` for the store. This requires `ValueRefMarshall` to support writable direct-bind for the entry-point-param case specifically, using `RWStructuredBuffer<T>` instead of `RWValueRef<T>`.
+
+**Fallback path**: `_result` stays as `RWValueRef<T>` inside `CallData`, same as current behavior.
+
+**Implementation note**: The `RWStructuredBuffer<T>` approach for `_result` is only used when `use_entrypoint_args == True` AND all other args are direct-bind (so Context can be omitted). When non-direct-bind args are present, Context exists and `_result` can continue to use `RWValueRef<T>.__slangpy_store(context, value)`.
+
+---
+
+### Step 2.7: Tests
+
+**Status: NOT STARTED**
+
+**Post-implementation tests** — should pass AFTER Phase 2 is complete:
+
+| Test | Verifies |
+|------|----------|
+| `test_phase2_no_calldata_struct` | `struct CallData` absent for eligible call |
+| `test_phase2_uniform_params_on_entry` | Individual `uniform` params on `compute_main` |
+| `test_phase2_no_trampoline_prim` | No `void _trampoline(` for prim-mode calls |
+| `test_phase2_inline_call` | Function call inlined directly in `compute_main` |
+| `test_phase2_thread_count_as_uniform` | `uniform uint3 _thread_count` as entry-point param |
+| `test_phase2_no_context_all_direct` | No `Context __slangpy_context__` when all args direct-bind |
+| `test_phase2_context_kept_non_direct` | `Context` present when some args use `__slangpy_load` |
+| `test_phase2_bwds_trampoline_individual` | Bwds trampoline has individual params with `no_diff` |
+| `test_phase2_bwds_bwd_diff_call` | `bwd_diff(_trampoline)(ctx, a, b, ...)` in kernel |
+| `test_phase2_no_sv_group_when_dim0` | No `SV_GroupID`/`SV_GroupIndex` when `call_data_len == 0` |
+| `test_phase2_sv_group_when_vectorized` | `SV_GroupID`/`SV_GroupIndex` present when `call_data_len > 0` |
+| `test_phase2_fallback_keeps_calldata` | Force fallback → `struct CallData` still emitted |
+| `test_phase2_fallback_no_trampoline_prim` | Even fallback path eliminates trampoline in prim mode |
+| `test_phase2_functional_scalar_add` | `add(1, 2) == 3` end-to-end dispatch |
+| `test_phase2_functional_bwds` | Backward pass correct gradients |
+| `test_phase2_functional_vectorized` | Vectorized call (shapes) with entry-point params |
+| `test_phase2_functional_mixed_direct` | Mix of direct-bind + non-direct-bind args |
+
+---
+
+### Implementation Order
+
+1. **Step 2.0** ✅ — Gating tests (baseline documentation)
+2. **Step 2.1** ✅ — Fast/fallback determination + size query
+3. **Step 2.2 + 2.5** ✅ — Code gen + C++ dispatch for entry-point params + `CallDataMode` removal (landed together)
+4. **Step 2.4** ✅ — Bwds trampoline with individual params (fast path) — `no_diff` on all params
+5. **Step 2.3** — Trampoline elimination for prim mode (both paths)
+6. **Step 2.6** — `_result` as `RWStructuredBuffer<T>` for all-direct-bind case
+7. **Step 2.7** — Post-implementation tests + functional tests
+
+**Note:** Implementation order deviated from original plan — Steps 2.2 + 2.5 were done before 2.3 (trampoline elimination), combined with `CallDataMode` removal. Step 2.4 done — all trampoline params use `no_diff` without IO modifiers.
+
+---
+
+### Key Files
+
+| File | Changes |
+|------|---------|
+| [slangpy/core/generator.py](slangpy/core/generator.py) | ✅ All code generation logic extracted here from `callsignature.py`. `generate_code()` orchestrator (line 778), `_emit_trampoline()` (line 578), `_emit_entry_point_signature()` (line 652), `_emit_kernel_body()` (line 708), `_emit_shape_and_metadata_params()` (line 466), `_emit_link_time_constants()` (line 424), `_emit_call_data_definitions()` (line 503), `_emit_trampoline_loads/stores()`, `_data_name()`, `_validate_and_compute_group_shape()`, `gen_call_data_code()`, `gen_calldata_type_name()`, `KernelGenException`. Entry-point params fast/fallback code paths. Bwds `no_diff` on all trampoline params (Step 2.4). Trampoline still generated for prim mode (Step 2.3 pending). |
+| [slangpy/core/calldata.py](slangpy/core/calldata.py) | ✅ `use_entrypoint_args` flag, size threshold check, `CallDataMode` removed |
+| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | ✅ Binding-pipeline functions only (`specialize`, `bind`, `calculate_*`, `estimate_entrypoint_arguments_size`). Re-exports `generate_code`, `generate_constants`, `KernelGenException` from `generator.py`. |
+| [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py) | ✅ `skip_call_data` flag, `entry_point_params` list |
+| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | ✅ `gen_call_data_code` and `gen_calldata_type_name` delegate to `generator.py`. `_gen_trampoline_argument()` unused dead code. |
+| [slangpy/bindings/marshall.py](slangpy/bindings/marshall.py) | ✅ `use_entrypoint_args` field on `BindContext`, `CallDataMode` removed |
+| [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) | ✅ `use_entrypoint_args` binding; `bind_call_data` fast path via `find_entry_point(0)`, `CallDataMode` branches removed |
+| [src/slangpy_ext/utils/slangpy.h](src/slangpy_ext/utils/slangpy.h) | ✅ `m_use_entrypoint_args` on `NativeCallData`; `m_call_data_mode` removed |
+| [src/sgl/device/device.h](src/sgl/device/device.h) | ✅ `max_entry_point_uniform_size` on `DeviceLimits` |
+| [src/sgl/device/device.cpp](src/sgl/device/device.cpp) | ✅ Per-backend defaults for `max_entry_point_uniform_size` |
+| [src/slangpy_ext/device/device.cpp](src/slangpy_ext/device/device.cpp) | ✅ Python binding for `max_entry_point_uniform_size` |
+| [src/sgl/utils/slangpy.h](src/sgl/utils/slangpy.h) | ✅ `CallDataMode` enum removed |
+| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | ✅ `CallDataMode` removed; imports `generate_constants` from `generator.py` |
+| [slangpy/core/packedarg.py](slangpy/core/packedarg.py) | ✅ `CallDataMode` removed |
+| [slangpy/core/function.py](slangpy/core/function.py) | ✅ `CallDataMode` removed from imports |
+| [slangpy/slangpy/__init__.pyi](slangpy/slangpy/__init__.pyi) | ✅ `CallDataMode` class and `call_data_mode` property removed |
+| [slangpy/tests/slangpy_tests/test_type_resolution.py](slangpy/tests/slangpy_tests/test_type_resolution.py) | ✅ `CallDataMode` removed from `BindContext` creation |
+| [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) | ✅ Gating tests + Step 2.1 tests updated for new behavior; post-implementation tests (Step 2.7) pending |
+
+---
+
+### Verification
+
+```bash
+# Build first (required)
+cmake --build --preset windows-msvc-debug
+
+# Run kernel gen tests
+$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v
+
+# Run full test suite
+pytest slangpy/tests -v
+
+# Run pre-commit
+pre-commit run --all-files
+```
diff --git a/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md b/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md
new file mode 100644
index 000000000..8736dca4a
--- /dev/null
+++ b/.github/prompts/plan-simplifyKernelGen-phase3.prompt.md
@@ -0,0 +1,48 @@
+## Phase 3: Direct Compute Kernel Invocation
+
+**Goal**: When the user's Slang function is ALREADY a `[shader("compute")]` entry point (or can trivially be one), skip kernel generation entirely and dispatch the pre-written shader directly.
+
+**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md)
+
+---
+
+### Step 3.1: Detection
+
+In the function resolution phase, detect when the target Slang function:
+- Has `[shader("compute")]` attribute
+- Has parameter types that SlangPy can bind directly (uniforms, buffers, textures)
+- Has explicit thread count specified by the user (already supported via `function.set_thread_count()`)
+
+---
+
+### Step 3.2: Direct dispatch path
+
+When eligible:
+- Skip Phase 2 (kernel generation) entirely
+- Create a `ComputePipeline` directly from the user's shader
+- Map Python arguments to entry point parameters using the type marshalling but without code generation
+- Dispatch directly
+
+---
+
+### Step 3.3: Argument binding
+
+Leverage Phase 2's per-argument binding infrastructure — the same cursor write logic that writes individual uniform params would write to the pre-written shader's entry point params.
+
+---
+
+### Step 3.4: Tests
+
+**Gating test** — assert CURRENT behavior so it breaks when Phase 3 is implemented:
+
+| Test | Slang Source | Args | Asserts (current behavior) | Breaks when |
+|------|-------------|------|---------------------------|-------------|
+| `test_gate_compute_shader_generates_wrapper` | Source with `[shader("compute")] void my_kernel(...)` function, test calls a helper function in the same module | N/A | SlangPy generates its own `compute_main` wrapper; user's `[shader("compute")]` is ignored | Step 3.1 |
+
+**Post-implementation tests** — should pass AFTER Phase 3 is complete:
+
+- `test_phase3_direct_dispatch`: dispatch a pre-written `[shader("compute")]` kernel directly, verify no wrapper generated
+- `test_phase3_requires_thread_count`: verify error when thread count not specified
+- `test_phase3_scalar_params`: verify scalar uniform params bind correctly
+- `test_phase3_buffer_params`: verify `RWStructuredBuffer` params bind correctly
+- `test_phase3_texture_params`: verify texture params bind correctly
diff --git a/.github/prompts/plan-simplifyKernelGen.prompt.md b/.github/prompts/plan-simplifyKernelGen.prompt.md
new file mode 100644
index 000000000..a21cee212
--- /dev/null
+++ b/.github/prompts/plan-simplifyKernelGen.prompt.md
@@ -0,0 +1,253 @@
+## Plan: Simplify Generated SlangPy Kernels
+
+**TL;DR**: A three-phase effort to make generated kernels resemble hand-written GPU code. Phase 1 adds direct type marshalling (bypassing `ValueType<T>` wrappers and `__slangpy_load`/`__slangpy_store`) for dim-0 non-composite types, following the pattern already used by `TensorView`. Phase 2 eliminates the `CallData` struct when all arguments are direct-eligible, passing them as individual uniforms/globals. Phase 3 enables calling pre-written compute kernels directly without generating wrapper shaders.
+
+**Target example** — `add(int a, int b) -> int` with scalar args should go from 40+ lines of boilerplate to approximately:
+
+```slang
+import "module";
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 tid: SV_DispatchThreadID, uniform uint3 _thread_count, uniform int a, uniform int b, uniform RWStructuredBuffer<int> _result)
+{
+    if (any(tid >= _thread_count)) return;
+    _result[0] = add(a, b);
+}
+```
+
+---
+
+### Phase Plans
+
+- [Phase 1: Direct Type Marshalling](plan-simplifyKernelGen-phase1.prompt.md) — **✅ merged (PR #863)**
+- [Phase 2: Eliminate CallData Struct](plan-simplifyKernelGen-phase2.prompt.md) — **not started**
+- [Phase 3: Direct Compute Kernel Invocation](plan-simplifyKernelGen-phase3.prompt.md) — **not started**
+
+---
+
+### Phase 1 Summary (Complete — PR #863)
+
+Phase 1 introduced **direct binding**: dim-0 arguments that can be bound using raw Slang types instead of `ValueType<T>` wrappers, eliminating `__slangpy_load`/`__slangpy_store` indirection, `Context.map()` calls, and mapping constants for eligible arguments. PR #863 was merged to `main` on 2026-03-11 (+2,044 / −122 lines, 18 files changed, squash-merged).
+
+#### What Phase 1 Changed
+
+**Architecture**: A marshall-driven `can_direct_bind(binding)` virtual method (default `False`) combined with a single depth-first `calculate_direct_bind()` pass on the `BoundVariable` tree. This follows the same pattern as `calculate_differentiability`. The `direct_bind` boolean is stored on `BoundVariable` (Python) and propagated to `NativeBoundVariableRuntime` (C++).
+
+**Eligibility**: A variable is direct-bind eligible if:
+- `call_dimensionality == 0` (not vectorized)
+- Not composite with children (unless all children are also direct-bind AND the composite is dim-0 with a concrete Slang struct type and read-only access)
+- Not a param block (`PackedArg`)
+- The marshall opts in via `can_direct_bind()` override
+- For `ValueRefMarshall`: `access[0] == AccessType.read` (writable value refs need buffer logic)
+
+**Code generation effects** — when `binding.direct_bind == True`:
+- `gen_calldata` emits `typealias _t_{name} = {raw_slang_type}` instead of `ValueType<T>` / `VectorValueType<T,N>` / `ValueRef<T>`
+- `gen_trampoline_load` emits `{value_name} = {data_name};` (direct assignment) instead of `{data_name}.__slangpy_load(context.map(_m_{name}), {name})`
+- `gen_trampoline_store` returns `True` (suppresses store for read-only types)
+- Mapping constants (`static const int _m_{name} = 0`) are skipped
+- `create_calldata` returns the raw value instead of `{"value": data}`
+
+**C++ fast path**: `NativeValueMarshall::ensure_cached` reads `binding->direct_bind()` to decide cursor navigation — `cursor[variable_name]` for direct-bind vs `cursor[variable_name]["value"]` for wrapper path.
+
+**Composite (struct/dict) handling**: When `calculate_direct_bind()` visits a composite, it recurses children first. If all children are direct-bind AND the composite is dim-0 with a concrete vector type and read-only access → the composite itself is direct-bind (emits raw `typealias`). Otherwise the composite is NOT direct-bind, but children **retain** their individual `direct_bind` status — the parent's `__slangpy_load`/`__slangpy_store` body uses `gen_trampoline_load`/`gen_trampoline_store` for each child, so direct-bind children get direct assignment (e.g., `value.y = y;`) while non-direct-bind children use `__slangpy_load(context.map(...))`.
+
+**API changes to `gen_trampoline_load`/`gen_trampoline_store`**: Signature changed from `(cgb, binding, is_entry_point)` → `(cgb, binding, data_name, value_name)`. The caller now computes `data_name` (e.g., `__calldata__.x` or `call_data.x`) and `value_name` (e.g., `x` or `value.x`), allowing these methods to work both at the root trampoline level and inside composite `__slangpy_load`/`__slangpy_store` bodies.
+
+**`read_output` fix** (C++): `NativeBoundVariableRuntime::read_output` was simplified — composites no longer attempt to read output directly (it is handled by their children). The old composite branch had a logic error (checking `res.contains(name)` before insertion).
+
+#### Control Flow (post-Phase 1)
+
+```
+CallData.build()
+  → calculate_differentiability(context, bindings)
+  → calculate_direct_binding(bindings)           ← Phase 1
+  → generate_code(...)
+      → gen_call_data_code()    — reads binding.direct_bind
+      → gen_trampoline()        — reads binding.direct_bind
+  → BoundCallRuntime(bindings)  — propagates binding.direct_bind to C++ runtime
+```
+
+At dispatch time, `NativeValueMarshall::ensure_cached()` reads `binding->direct_bind()` to decide cursor navigation:
+- `direct_bind == false`: `cursor[variable_name]["value"]` (wrapper path)
+- `direct_bind == true`: `cursor[variable_name]` (raw type path)
+
+#### Implemented Steps
+
+| Step | Status | Summary |
+|------|--------|---------|
+| 1.1 | ✅ Done | `Marshall.can_direct_bind(binding)` virtual method. `can_direct_bind_common(binding)` helper. `BoundVariable.calculate_direct_bind()` depth-first tree pass. `calculate_direct_binding(call)` in `callsignature.py`. |
+| 1.2 | ✅ Done | `ValueMarshall`: `can_direct_bind`, `gen_calldata`, `gen_trampoline_load/store` read `binding.direct_bind`. |
+| 1.2a | ✅ Done | C++ fast path: `NativeValueMarshall::ensure_cached` reads `binding->direct_bind()` from `NativeBoundVariableRuntime`. `m_direct_bind` **removed** from `NativeValueMarshall`. |
+| 1.3 | ✅ Done | `VectorMarshall`/`MatrixMarshall`/`ArrayMarshall`: inherit from `ValueMarshall`. `VectorMarshall.gen_calldata` emits raw vector type (e.g., `vector<float,3>`). |
+| 1.4 | ✅ Done | `StructMarshall`: `can_direct_bind` checks all children. `BoundVariable.gen_call_data_code` uses `self.direct_bind`. Non-direct-bind composites delegate to children's `gen_trampoline_load/store`. |
+| 1.5 | ✅ Done | `ValueRefMarshall`: `can_direct_bind` requires `access[0] == AccessType.read`. Writable value refs (including auto-created `_result`) use `RWValueRef<T>`. |
+| 1.6 | ✅ Done | Tensor dim-0: `can_direct_bind` added to `tensorcommon.py`. `gen_trampoline_load/store` extended for dim-0 tensors (`ITensorType`, `TensorViewType`, `DiffTensorViewType`). |
+| 1.7 | ✅ Done | Mapping constants (`static const int _m_{name}`) skipped when `self.direct_bind`. |
+| 1.8 | ⬜ Deferred | Autodiff derivative fields still use `ValueType` wrappers. Bwds primals use direct bind. |
+| 1.9 | ✅ Done | 77 tests (×3 device types = 231 cases). All pass on d3d12/vulkan/cuda. |
+
+#### Files Modified (PR #863)
+
+| File | Changes |
+|------|---------|
+| `src/slangpy_ext/utils/slangpy.h` | `m_direct_bind` member, `direct_bind()`, `set_direct_bind()` on `NativeBoundVariableRuntime` |
+| `src/slangpy_ext/utils/slangpy.cpp` | Nanobind `direct_bind` r/w property on `NativeBoundVariableRuntime`. `read_output` composite branch simplified. |
+| `src/slangpy_ext/utils/slangpyvalue.h` | `CachedValueWrite.direct_bind` field added. `m_direct_bind`/`direct_bind()`/`set_direct_bind()` **removed** from `NativeValueMarshall`. |
+| `src/slangpy_ext/utils/slangpyvalue.cpp` | `ensure_cached` reads `binding->direct_bind()` for cursor path; caches `direct_bind` value. |
+| `slangpy/bindings/marshall.py` | `can_direct_bind(binding)` virtual method (default `False`). `gen_trampoline_load/store` signature changed to `(cgb, binding, data_name, value_name)`. |
+| `slangpy/bindings/boundvariable.py` | `can_direct_bind_common()` helper. `BoundVariable.direct_bind` attribute. `BoundVariable.calculate_direct_bind()` method. `gen_call_data_code` handles direct-bind composites (raw typealias) and delegates to children's `gen_trampoline_load/store`. Mapping constant emission gated on `not self.direct_bind`. |
+| `slangpy/bindings/boundvariableruntime.py` | `self.direct_bind = source.direct_bind` propagation to C++ runtime. |
+| `slangpy/bindings/__init__.py` | Exports `can_direct_bind_common`. |
+| `slangpy/core/callsignature.py` | `calculate_direct_binding(call)` function. Trampoline code gen refactored: `data_name` computed before `gen_trampoline_load` call. Store path moved after `data_name` computation. |
+| `slangpy/core/calldata.py` | `calculate_direct_binding(bindings)` call after `calculate_differentiability`. `self.code = code` stored for debugging. |
+| `slangpy/builtin/value.py` | `can_direct_bind`, `gen_trampoline_load`, `gen_trampoline_store` added. `gen_calldata` gates on `binding.direct_bind`. |
+| `slangpy/builtin/valueref.py` | `can_direct_bind` (read-only gate), `gen_trampoline_load`, `gen_trampoline_store` added. `gen_calldata`, `create_calldata`, `read_calldata` gate on `binding.direct_bind`. `self._direct_bind` removed. |
+| `slangpy/builtin/struct.py` | `can_direct_bind` (children check + `AccessType.read` gate). `gen_trampoline_load`, `gen_trampoline_store` delegate to `ValueMarshall` when direct-bind. |
+| `slangpy/builtin/tensor.py` | `can_direct_bind` delegates to `tensorcommon`. `gen_trampoline_load/store` signature updated. |
+| `slangpy/builtin/tensorcommon.py` | `can_direct_bind()` function added. `gen_trampoline_load/store` signature changed, condition changed from `isinstance(vector_type, TensorViewType)` to `binding.direct_bind`. |
+| `slangpy/torchintegration/torchtensormarshall.py` | `can_direct_bind` delegates to `tensorcommon`. `gen_trampoline_load/store` signature updated. |
+| `slangpy/benchmarks/test_benchmark_autograd.py` | Removed accidental blank line (1-line whitespace change). |
+| `slangpy/tests/slangpy_tests/test_kernel_gen.py` | New file: 77 tests covering all Phase 1 scenarios. |
+
+#### Test Coverage Summary
+
+The test file (`test_kernel_gen.py`) provides 77 test functions × 3 device types = 231 parametrized cases covering:
+
+**Code-gen assertion tests** (`test_gate_*`): Verify generated Slang code patterns — type aliases, trampoline load/store statements, mapping constants, wrapper types, `__slangpy_load`/`__slangpy_store` presence/absence.
+
+**Binding flag tests**: Verify `direct_bind`, `call_dimensionality`, and `vector_type` on `BoundVariable` instances for: scalars, vectors, tensors (dim-0 and vectorized), structs (all-scalar, mixed, nested, deeply nested), writable ValueRef, auto-created `_result`, WangHashArg, bwds primal args.
+
+**Functional GPU dispatch tests** (`test_phase1_functional_*`): End-to-end dispatch verifying correct GPU results for: scalar add/mul, vector scale, struct sum, ValueRef write, mixed scalar+tensor, mixed struct fields, tensor dim-0, 2D/3D tensor→vector, 2D tensor→scalar, 2D tensor→array, nested/deeply-nested structs, struct with matrix/vector/array fields, struct return types, struct with vectorized 2D tensor child.
+
+**Negative gates** (`test_gate_*_keeps_*`): Verify types that are NOT direct-bind eligible remain using wrappers: WangHashArg, vectorized scalar (dim > 0), vectorized dict.
+
+**Helper infrastructure**: `assert_contains`, `assert_not_contains`, `assert_trampoline_has`, `generate_code`, `generate_bwds_code`.
+
+#### Known Issues (from review, not yet addressed)
+
+1. **`set_direct_bind` exposed as read-write nanobind property** — After first dispatch, mutating `direct_bind` would invalidate the cached cursor offset. Consider making it read-only.
+
+2. **C++ cache safety** — `NativeValueMarshall::ensure_cached` caches `direct_bind` but has no debug assertion verifying it matches on subsequent calls.
+
+3. **Dead `binding.direct_bind` checks in writable ValueRef paths** — `create_calldata` and `read_calldata` in `valueref.py` have `assert not binding.direct_bind` in writable code paths (reachable only as assertions, since `can_direct_bind` rejects non-read access).
+
+---
+
+### What Phase 2 Needs to Know
+
+Phase 2 builds on Phase 1's `direct_bind` infrastructure. Key context for implementation:
+
+**Current kernel structure** (post-Phase 1, for `int add(int a, int b)` with args `(1, 2)`):
+```slang
+import "module";
+import "slangpy";
+// CallData struct with per-arg type aliases and mapping constants
+struct CallData {
+    typealias _t_a = int;         // Phase 1: raw type (was ValueType<int>)
+    _t_a a;
+    typealias _t_b = int;         // Phase 1: raw type (was ValueType<int>)
+    _t_b b;
+    typealias _t__result = RWValueRef<int>;  // writable _result still wrapped
+    _t__result _result;
+    static const int _m__result = 0;         // mapping constant only for _result
+    uint3 _thread_count;
+    // ... shape arrays if call_data_len > 0 ...
+};
+void _trampoline(CallData call_data /*or __calldata__ on CUDA*/) {
+    int a;
+    a = call_data.a;              // Phase 1: direct assignment (was __slangpy_load)
+    int b;
+    b = call_data.b;              // Phase 1: direct assignment
+    int _result;
+    _result = add(a, b);
+    call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result);
+}
+[shader("compute")] [numthreads(32,1,1)]
+void compute_main(..., uniform CallData call_data) {
+    // thread bounds check, context construction
+    _trampoline(call_data);
+}
+```
+
+**Phase 2 goal**: Eliminate the `CallData` struct entirely when ALL args are direct-bind eligible. Pass args as individual `uniform` parameters on the entry point. Inline the function call into `compute_main` (skip trampoline for prim mode).
+
+**Blocking issue for Phase 2**: Auto-created `_result` is a writable `ValueRef` → NOT direct-bind (needs `RWValueRef<T>` wrapper with buffer). Phase 2 must either:
+- Accept that `_result` prevents full CallData elimination for functions with return values, and use a hybrid approach (direct args + `_result` in CallData or as a separate `RWStructuredBuffer` entry point param), OR
+- Add a new code path for `_result` that emits `uniform RWStructuredBuffer<T> _result` as an entry point param with `_result[0] = ...` for the store
+
+**Key files for Phase 2**:
+- `slangpy/core/callsignature.py` — `generate_code()` builds the trampoline and compute_main
+- `slangpy/core/calldata.py` — `CallData.build()` orchestrates the pipeline
+- `slangpy/bindings/codegen.py` — `CodeGen` class manages `call_data_structs` block
+- `src/slangpy_ext/utils/slangpy.cpp` — `NativeCallData::exec()` dispatches; cursor navigation for uniforms
+
+**`BoundVariable.direct_bind`** is already computed for all args by Phase 1. Phase 2 can check `all(arg.direct_bind for arg in all_args)` to decide whether to use the direct-args path.
+
+**Entry point parameter precedent**: See `slangpy/tests/device/test_pipeline_utils.slang` — manually written compute shaders already use individual `uniform` entry point params on all backends (CUDA, Vulkan, D3D12).
+
+**Design decisions deferred to Phase 2**:
+- Whether to support hybrid kernels (some args as entry-point params, some in CallData) or only all-or-nothing
+- Handle entry-point parameter size limits (CUDA ~4KB root constants, D3D12 64 DWORD root signature limit)
+- Whether to inline the function call directly in compute_main for prim mode, or keep a simplified trampoline
+
+---
+
+### Gating Tests — Pre-Implementation Checklist
+
+Before implementing any phase, add **gating tests** to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) that assert the CURRENT generated kernel patterns. These tests document the baseline and will intentionally break as each simplification step is implemented.
+
+**Design principles:**
+- All gating tests are code-generation-only (no GPU dispatch) — fast and deterministic
+- All tests use the existing `generate_code()` helper → `func.debug_build_call_data()` → `cd.code`
+- Tests are parametrized across `helpers.DEFAULT_DEVICE_TYPES`
+- String matching (substring checks) rather than regex or golden files
+- Named `test_gate_*` for easy identification
+- WangHashArg and dict/composite tests serve as "negative gates" — they remain passing after simplification
+
+**Test infrastructure** (already present in `test_kernel_gen.py`):
+```python
+def assert_contains(code: str, *patterns: str) -> None
+def assert_not_contains(code: str, *patterns: str) -> None
+def assert_trampoline_has(code: str, *stmts: str) -> None
+def generate_code(device, func_name, module_source, *args, **kwargs) -> str
+def generate_bwds_code(device, func_name, module_source, *args, **kwargs) -> str
+```
+
+**Phase 2 gating tests to add** (assert CURRENT behavior, will break on implementation):
+
+| Test | Asserts (current behavior) | Breaks when |
+|------|---------------------------|-------------|
+| `test_gate_calldata_struct_present` | `struct CallData` present | Step 2.1 |
+| `test_gate_calldata_uniform_param` | `uniform CallData call_data` in `compute_main` | Step 2.2 |
+| `test_gate_thread_count_in_calldata` | `call_data._thread_count` in kernel body | Step 2.4 |
+| `test_gate_context_from_calldata` | `Context __slangpy_context__` present | Step 2.4 |
+| `test_gate_trampoline_present_for_prim` | `void _trampoline(` present | Step 2.5 |
+| `test_gate_trampoline_calls_function` | `_result = add(a, b)` inside trampoline | Step 2.5 |
+| `test_gate_kernel_calls_trampoline` | `_trampoline(` inside `compute_main` | Step 2.5 |
+| `test_gate_wanghasharg_forces_calldata` (negative) | `struct CallData` present with non-eligible arg | Must stay passing |
+
+---
+
+### Verification (all phases)
+
+```bash
+# Build first (required)
+cmake --build --preset windows-msvc-debug
+
+# Run kernel gen tests
+$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v
+
+# Run full test suite
+pytest slangpy/tests -v
+
+# Run pre-commit
+pre-commit run --all-files
+```
+
+### Key Decisions
+
+- Phase 1 changes both `gen_calldata` and trampoline load/store (TensorView-complete pattern, not partial)
+- All dim-0 non-composite types are eligible, excluding writable value refs (which need buffer logic)
+- Phase 2 targets both `entry_point` (CUDA) and `global_data` (Vulkan/D3D12) modes
+- Autograd (bwds mode) is included in simplification, but implemented after prim mode within each phase
+- WangHashArg explicitly excluded from direct binding (needs per-thread `thread_id` computation)
diff --git a/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md b/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md
new file mode 100644
index 000000000..9aedfefcd
--- /dev/null
+++ b/.github/prompts/plan-simplifyKernelGenPhase2-cleanup.prompt.md
@@ -0,0 +1,611 @@
+## Phase 2: Eliminate CallData Struct
+
+**Goal**: Move kernel uniforms out of the `CallData` struct into individual entry-point parameters. Eliminate the trampoline in forward (prim) mode. Fall back to `ParameterBlock<CallData>` when total inline-uniform size exceeds a runtime per-device threshold.
+
+**Parent plan**: [plan-simplifyKernelGen.prompt.md](plan-simplifyKernelGen.prompt.md)
+
+---
+
+### Key Architectural Decisions
+
+These decisions correct several assumptions in the original plan:
+
+1. **Entry-point param placement is orthogonal to `direct_bind`.** Any type — wrapped or raw — can be an entry-point parameter (e.g., `uniform ValueType<int> a` or `uniform int a` or `uniform Tensor<float,2> t`). `direct_bind` governs whether `__slangpy_load`/`__slangpy_store` is needed inside the kernel; entry-point placement governs where the uniform lives in the shader layout.
+
+2. **Trampoline elimination is independent of `direct_bind`.** The current trampoline body is: declare locals → load (direct assignment or `__slangpy_load`) → call function → store (`__slangpy_store`). All of that can appear directly in `compute_main`. The trampoline only exists because bwds mode needs a `[Differentiable]` wrapper for `bwd_diff()`. In prim mode, it is eliminated regardless of whether args use wrappers.
+
+3. **All-or-nothing fallback.** When total inline-uniform size exceeds the platform threshold, ALL args go back into `ParameterBlock<CallData>` (the current path). No hybrid mixing of entry-point params and CallData.
+
+4. **Shape arrays and `_thread_count` obey the same rules** as user args — they become entry-point params by default, and go into `CallData` on fallback. Phase 2 is NOT scoped only to `call_data_len == 0`.
+
+5. **Two code paths based on where data lives:**
+   - **Fast path** (entry-point params): In Slang, uniforms are entry-point parameters and can be used directly (in forward) or passed directly to the trampoline (in backward).
+   - **Fallback path** (`ParameterBlock<CallData>`): In Slang, uniforms live in a `CallData` struct. They must be read into local variables before being used (in forward) or passed to the trampoline (in backward). This is the current behavior.
+
+6. **C++ dispatch changes are isolated to `NativeCallData::exec`.** Marshalls receive a `ShaderCursor` pointing to wherever their data lives — they don't care whether it's inside a `CallData` struct or an entry-point param. In the fast path, `m_runtime->write_shader_cursor_pre_dispatch()` receives the entry-point cursor directly. No marshall code changes needed.
+
+7. **`CallDataMode` is eliminated.** The `global_data` vs `entry_point` distinction is removed entirely. On the fast path, all backends use entry-point params uniformly. On the fallback path, all backends use `ParameterBlock<CallData>` — CUDA supports `ParameterBlock` and in practice will never hit the fallback (CUDA's inline-uniform limit is ~4KB). This removes the `CallDataMode` enum, the CUDA-specific `is_entry_point` codegen branch in `callsignature.py`, and the corresponding C++ branch in `slangpy.cpp`.
+
+8. **`PackedArg` / param-block types are unchanged.** They stay as `ParameterBlock<T>` at module scope, orthogonal to Phase 2.
+
+---
+
+### Current Kernel Structure (post-Phase 1)
+
+For `int add(int a, int b)` with scalar args `(1, 2)`:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_a = int;             // Phase 1: raw type (was ValueType<int>)
+typealias _t__result = RWValueRef<int>;  // writable _result still wrapped
+static const int _m__result = 0;         // mapping constant only for _result
+
+struct CallData {
+    _t_a a;
+    _t_a b;
+    _t__result _result;
+    uint3 _thread_count;
+};
+
+void _trampoline(Context __slangpy_context__, CallData __calldata__) {
+    int a;
+    a = __calldata__.a;            // Phase 1: direct assignment
+    int b;
+    b = __calldata__.b;            // Phase 1: direct assignment
+    int _result;
+    _result = add(a, b);
+    __calldata__._result.__slangpy_store(__slangpy_context__.map(_m__result), _result);
+}
+
+[shader("compute")] [numthreads(32,1,1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ..., uniform CallData call_data) {
+    if (any(flat_call_thread_id >= call_data._thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    _trampoline(__slangpy_context__, call_data);
+}
+```
+
+### Target Kernel (Phase 2 fast path, prim mode, all direct-bind)
+
+```slang
+import "module";
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 tid: SV_DispatchThreadID,
+    uniform uint3 _thread_count,
+    uniform int a,
+    uniform int b,
+    uniform RWStructuredBuffer<int> _result)
+{
+    if (any(tid >= _thread_count)) return;
+    _result[0] = add(a, b);
+}
+```
+
+### Target Kernel (Phase 2 fast path, prim mode, mixed direct/non-direct-bind)
+
+When some args are not direct-bind (e.g., WangHashArg needs per-thread `thread_id` via `__slangpy_load`), the non-direct-bind args still use their wrapper types as entry-point params. Context is needed:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_rng = WangHashArgType;  // non-direct-bind wrapper type
+static const int _m_rng = 0;
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID,
+    uniform uint3 _thread_count,
+    uniform _t_rng rng,
+    uniform int x,
+    uniform RWStructuredBuffer<int> _result)
+{
+    if (any(flat_call_thread_id >= _thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    int _rng_val;
+    rng.__slangpy_load(__slangpy_context__.map(_m_rng), _rng_val);
+    int _x_val;
+    _x_val = x;
+    int _result_val;
+    _result_val = func(_rng_val, _x_val);
+    _result[0] = _result_val;
+}
+```
+
+### Target Kernel (Phase 2 fallback path, prim mode)
+
+When entry-point param size exceeds the platform limit, all args go into `ParameterBlock<CallData>`. The trampoline is still eliminated in prim mode — the load/call/store is inlined into `compute_main`, reading from `call_data`:
+
+```slang
+import "module";
+import "slangpy";
+
+typealias _t_a = int;
+typealias _t__result = RWValueRef<int>;
+static const int _m__result = 0;
+
+struct CallData {
+    _t_a a;
+    _t_a b;
+    _t__result _result;
+    uint3 _thread_count;
+};
+ParameterBlock<CallData> call_data;
+
+[shader("compute")]
+[numthreads(32, 1, 1)]
+void compute_main(int3 flat_call_thread_id: SV_DispatchThreadID, ...) {
+    if (any(flat_call_thread_id >= call_data._thread_count)) return;
+    Context __slangpy_context__ = {flat_call_thread_id};
+    int a;
+    a = call_data.a;
+    int b;
+    b = call_data.b;
+    int _result;
+    _result = add(a, b);
+    call_data._result.__slangpy_store(__slangpy_context__.map(_m__result), _result);
+}
+```
+
+---
+
+### Step 2.0: Gating tests ✅
+
+**Status: DONE**
+
+Tests added to [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py). All 21 parametrized cases (7 tests × 3 device types) pass.
+
+| Test | Source | Args | Original assertion | Status |
+|------|--------|------|--------------------|--------|
+| `test_gate_p2_calldata_struct_present` | `int add(int a, int b)` | `(1, 2)` | `struct CallData` in code | ✅ Flipped — now asserts `struct CallData` ABSENT (Step 2.2 done) |
+| `test_gate_p2_calldata_uniform_param` | same | same | `uniform CallData call_data` or `ParameterBlock<CallData>` | ✅ Flipped — now asserts both ABSENT (Step 2.2 done) |
+| `test_gate_p2_thread_count_in_calldata` | same | same | `call_data._thread_count` | ✅ Flipped — now asserts ABSENT (Step 2.2 done) |
+| `test_gate_p2_trampoline_present_for_prim` | same | same | `void _trampoline(` present | Still asserts present (Step 2.3 pending) |
+| `test_gate_p2_kernel_calls_trampoline` | same | same | `_trampoline(` in `compute_main` body | Still asserts present (Step 2.3 pending) |
+| `test_gate_p2_sv_group_id_present` | same | same | `SV_GroupID` in `compute_main` signature | ✅ Flipped — now asserts ABSENT for dim-0 calls (Step 2.2 done) |
+
+Negative gates (must stay passing after Phase 2):
+
+| Test | Asserts |
+|------|---------|
+| `test_gate_p2_wanghasharg_keeps_load` | Non-direct-bind arg still uses `__slangpy_load` |
+
+Bwds gates:
+
+| Test | Status |
+|------|--------|
+| `test_gate_scalar_uses_valuetype` | ✅ Passing — asserts fast-path trampoline with `__in_` prefix params |
+| `test_gate_bwds_scalar_uses_valuetype` | ✅ Passing — bwds trampoline has `no_diff` on all params (Step 2.4 done) |
+
+---
+
+### Step 2.1: Determine fast vs fallback path ✅
+
+**Status: DONE**
+
+In [slangpy/core/calldata.py](slangpy/core/calldata.py), after `calculate_direct_binding(bindings)`:
+
+1. **Query a runtime per-device threshold** for max entry-point parameter inline-uniform size. This is a property of the device/backend — large for D3D12/CUDA (thousands of bytes), potentially as low as 128–256 bytes on Vulkan.
+2. **Accumulate inline-uniform byte size** of each bound variable's `calldata_type_name`, plus `_thread_count` (12 bytes) and shape arrays (`call_data_len * 3 * sizeof(int)` for `_grid_stride`, `_grid_dim`, `_call_dim`). **Resource types** (`RWStructuredBuffer`, `Texture2D`, `TensorView`, etc.) don't count — they are bound as descriptors, not inline data.
+3. **Decision**: If total size ≤ threshold → `self.use_entrypoint_args = True` (fast path). Otherwise → `self.use_entrypoint_args = False` (fallback path — current behavior).
+4. **Store** `use_entrypoint_args` on the `CallData` instance and propagate to C++ `NativeCallData`.
+
+`PackedArg` / param-block types are excluded from this accounting — they stay as `ParameterBlock<T>` regardless.
+
+**Implementation details:**
+
+- `DeviceLimits.max_entry_point_uniform_size` added to C++ struct ([device.h](src/sgl/device/device.h)) with per-backend defaults: Vulkan=128, D3D12=256, CUDA=4096 bytes ([device.cpp](src/sgl/device/device.cpp)).
+- `calculate_inline_uniform_size()` added to [callsignature.py](slangpy/core/callsignature.py) — sums `vector_type.uniform_layout.size` for each depth-0 bound variable (skipping `PackedArg`), plus 12 bytes for `_thread_count` and `call_dimensionality * 4 * 3` for shape arrays.
+- `use_entrypoint_args` property added to `NativeCallData` C++ class ([slangpy.h](src/slangpy_ext/utils/slangpy.h)) with Python binding.
+- `CallData.__init__()` in [calldata.py](slangpy/core/calldata.py) sets `self.use_entrypoint_args = inline_size <= threshold` after `calculate_direct_binding()`.
+
+**Tests** (7 tests × 3 device types = 21 parametrized cases, all pass):
+
+| Test | Asserts |
+|------|---------|
+| `test_step21_scalar_uses_entrypoint_args` | Simple `int add(int,int)` with `(1,2)` → `use_entrypoint_args=True` |
+| `test_step21_threshold_property_positive` | `device.info.limits.max_entry_point_uniform_size > 0` |
+| `test_step21_vector_uses_entrypoint_args` | `float3` args → `use_entrypoint_args=True` |
+| `test_step21_struct_uses_entrypoint_args` | All-scalar struct dict → `use_entrypoint_args=True` |
+| `test_step21_tensor_uses_entrypoint_args` | Tensor (descriptor-only, 0 inline bytes) → `use_entrypoint_args=True` |
+| `test_step21_many_float4x4_may_exceed_vulkan` | 8×float4x4 (524 bytes) exceeds Vulkan/D3D12 thresholds, not CUDA |
+| `test_step21_wanghasharg_uses_entrypoint_args` | Non-direct-bind WangHashArg with small inline size → `use_entrypoint_args=True` |
+
+---
+
+### Step 2.2: Code generation — entry-point params (fast path) ✅
+
+**Status: DONE**
+
+In [slangpy/core/callsignature.py](slangpy/core/callsignature.py) `generate_code()`, when `use_entrypoint_args == True`:
+
+**CodeGen changes** in [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py):
+- Add a `skip_call_data` flag to `CodeGen.__init__`. When `True`, don't emit `struct CallData` / `begin_block()` and gate `end_block()` in `finish()`.
+- Add `self.entry_point_params: list[str] = []` to collect individual uniform param declarations.
+- `finish()` ignores the `call_data` block and `use_param_block_for_call_data` when `skip_call_data` is set.
+
+**CallData struct elimination**: Set `cg.skip_call_data = True` when `use_entrypoint_args`. No `struct CallData` emitted.
+
+**`gen_call_data_code` change** in [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py): At `depth == 0`, when `use_entrypoint_args`, append to `cg.entry_point_params` instead of `cg.call_data.declare(...)`. The `call_data_structs` block (type aliases, wrapper structs, mapping constants) still gets emitted at module scope.
+
+**`_thread_count` and shape arrays**: Instead of `cg.call_data.append_statement("uint3 _thread_count")`, append to `cg.entry_point_params`. Same for `_grid_stride`, `_grid_dim`, `_call_dim` when `call_data_len > 0`.
+
+**Entry-point signature**: `compute_main` signature becomes:
+```slang
+void compute_main(
+    int3 flat_call_thread_id: SV_DispatchThreadID,
+    [int3 flat_call_group_id: SV_GroupID,]          // only when call_data_len > 0
+    [int flat_call_group_thread_id: SV_GroupIndex,]  // only when call_data_len > 0
+    uniform uint3 _thread_count,
+    [uniform int[N] _grid_stride, ...]               // only when call_data_len > 0
+    uniform _t_a a,
+    uniform _t_b b,
+    uniform _t__result _result
+)
+```
+
+Drop `SV_GroupID` and `SV_GroupIndex` when `call_data_len == 0` — they feed `init_thread_local_call_shape_info` which isn't called when there are no shape arrays.
+
+**Bounds check**: Changes from `call_data._thread_count` to just `_thread_count`.
+
+**Shape info init**: Changes from `call_data._grid_stride` etc. to just `_grid_stride`, `_grid_dim`, `_call_dim`.
+
+**Fallback path** (`use_entrypoint_args == False`): `struct CallData` is emitted with `ParameterBlock<CallData> call_data` at module scope on ALL backends (including CUDA). The old `CallDataMode` distinction between `entry_point` (CUDA) and `global_data` (non-CUDA) is removed — `ParameterBlock` works on CUDA, and in practice CUDA will never hit the fallback due to its large (~4KB) inline-uniform limit.
+
+See [slangpy/tests/device/test_pipeline_utils.slang](slangpy/tests/device/test_pipeline_utils.slang) for examples of manually-written compute shaders that use entry point parameters on all backends (CUDA, Vulkan, D3D12).
+
+---
+
+### Step 2.3: Trampoline elimination for prim mode
+
+**Status: NOT STARTED** — Trampoline is still generated for prim mode on both paths. The load/call/store sequence needs to be inlined into `compute_main`.
+
+When `call_mode == prim` — on **both** fast and fallback paths:
+
+- Don't generate the `_trampoline` function.
+- Inline the load/call/store sequence directly into `compute_main` after the bounds check and (if needed) Context construction.
+- The load/call/store codegen reuses the same logic currently in [callsignature.py lines 378–449](slangpy/core/callsignature.py#L378-L449), but emitted into `cg.kernel` instead of `cg.trampoline` with adjusted `data_name`:
+
+| Path | `data_name` for non-param-block args |
+|------|-------------------------------------|
+| Fast | `x.variable_name` (entry-point param name directly) |
+| Fallback | `call_data.{x.variable_name}` (global `ParameterBlock<CallData>`, all backends) |
+| Param blocks | `_param_{x.variable_name}` (unchanged) |
+
+**Context construction**: Needed only when any arg is non-direct-bind (i.e., calls `__slangpy_load`/`__slangpy_store`). When all args satisfy `direct_bind == True`, skip Context construction entirely — no `Context __slangpy_context__` declaration, no `import "slangpy"`.
+
+**Note**: The trampoline elimination does NOT depend on `direct_bind`. Even non-direct-bind args with `__slangpy_load` work inline in `compute_main` — the `__slangpy_load` call just needs the data reference and a `Context` value, both available in `compute_main`.
+
+---
+
+### Step 2.4: Trampoline with individual params for bwds mode ✅
+
+**Status: DONE** — Fast-path trampoline takes individual params with `no_diff` on all params. All 3 device types pass.
+
+When `call_mode == bwds`:
+
+- Still generate a `[Differentiable]` trampoline function.
+- **Fast path**: Trampoline takes individual params instead of a struct. All params get `no_diff` — entry-point uniforms are never differentiable. Differentiation happens through local variable assignments inside the trampoline body, matching the struct-based approach where `CallData` was implicitly non-differentiable. No `in`/`out`/`inout` modifiers are added — `compute_main` passes its uniforms straight through:
+  ```slang
+  [Differentiable]
+  void _trampoline(Context __slangpy_context__, no_diff float __in_a, no_diff float __in_b, no_diff NoneType __in__result)
+  ```
+  `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, a, b, _result)` passing entry-point param names directly.
+- **Fallback path**: Trampoline reads from global `ParameterBlock<CallData> call_data` as it does today (on all backends). `compute_main` calls `bwd_diff(_trampoline)(__slangpy_context__, call_data)`.
+- `_gen_trampoline_argument()` in `boundvariable.py` remains unused dead code — the inline generation in `callsignature.py` is simpler and avoids the `in`/`out`/`inout` modifiers that caused Slang autodiff errors.
+
+**Key insight**: Adding `in`/`out`/`inout` modifiers to trampoline params caused Slang autodiff issues (e.g., `out` params get reversed to `in` by `bwd_diff`, changing arity). The trampoline params are just pass-through uniforms — all data flow logic (loads, stores, differentiation) is handled internally via local variables.
+
+---
+
+### Step 2.5: C++ dispatch changes ✅
+
+**Status: DONE** — `CallDataMode` enum fully removed. Fast path uses `find_entry_point(0)` on all backends. Fallback path uses global `ParameterBlock<CallData>` on all backends.
+
+In [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp), store `m_use_entrypoint_args` on `NativeCallData` (received from Python `CallData`). Also add to [slangpy.h](src/slangpy_ext/utils/slangpy.h).
+
+Modify `bind_call_data` lambda in `exec()`:
+
+**Fast path** (`m_use_entrypoint_args == true`):
+- All backends: Navigate via `cursor.find_entry_point(0)`. This is the entry-point cursor.
+- Write `_thread_count` as an entry-point param: `entry_point_cursor["_thread_count"]`.
+- Write shape arrays as entry-point params: `entry_point_cursor["_grid_stride"]`, etc.
+- Pass `entry_point_cursor` as the `call_data_cursor` argument to `m_runtime->write_shader_cursor_pre_dispatch()`. Each `NativeBoundVariableRuntime` already navigates `cursor[m_variable_name]`, so it finds the entry-point param by name automatically. **No marshall code changes needed.**
+- Cache entry-point param field indices on first call (analogous to existing `m_cached_call_data_offsets`).
+- The `reserve_data` + raw-pointer optimization for `_thread_count` and shape arrays may not work for individual entry-point params at disjoint offsets. Use cursor-based writes for these metadata fields (they're small, performance impact minimal), or check if `reserve_data` still works across the entry-point shader object.
+
+**Fallback path** (`m_use_entrypoint_args == false`):
+- All backends: Navigate to global `call_data` field via `cursor.find_field("call_data")`, dereference (it's a `ParameterBlock`), write struct data. The old `CallDataMode` branch (CUDA using `find_entry_point(0)` for call_data) is removed. Remove `m_call_data_mode`, `CallDataMode` enum, and all associated branches from `slangpy.h`, `slangpy.cpp`, `calldata.py`, and `callsignature.py`.
+
+---
+
+### Step 2.6: `_result` handling
+
+**Status: NOT STARTED**
+
+Auto-created `_result` is a writable `ValueRef`, currently NOT direct-bind eligible (needs `RWValueRef<T>` wrapper with buffer logic). Phase 2 handles this differently on the two paths:
+
+**Fast path**: `_result` is emitted as `uniform RWValueRef<int> _result` on the entry point. In prim mode, the inlined code stores via `_result.__slangpy_store(...)`. In the all-direct-bind case where Context is omitted, add a new code path: emit `uniform RWStructuredBuffer<T> _result` with `_result[0] = value` for the store. This requires `ValueRefMarshall` to support writable direct-bind for the entry-point-param case specifically, using `RWStructuredBuffer<T>` instead of `RWValueRef<T>`.
+
+**Fallback path**: `_result` stays as `RWValueRef<T>` inside `CallData`, same as current behavior.
+
+**Implementation note**: The `RWStructuredBuffer<T>` approach for `_result` is only used when `use_entrypoint_args == True` AND all other args are direct-bind (so Context can be omitted). When non-direct-bind args are present, Context exists and `_result` can continue to use `RWValueRef<T>.__slangpy_store(context, value)`.
+
+---
+
+### Step 2.7: Tests
+
+**Status: PARTIAL** — Tests for completed Phase 2 steps added to [test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py). Remaining tests for Step 2.3 (trampoline elimination) and Step 2.6 (`_result` as `RWStructuredBuffer`) will be added when those steps are implemented.
+
+**Tests added** (in [test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py), tests 35–38, 40):
+
+| Test | Verifies | Merges from test_kernel_gen.py |
+|------|----------|-------------------------------|
+| `test_entrypoint_params_scalar_dim0` (#35) | Fast path: no `struct CallData`, individual `uniform` params, `_thread_count` direct, `SV_GroupID` absent at dim-0, `use_entrypoint_args=True` | `test_gate_p2_calldata_struct_absent_fast_path`, `test_gate_p2_individual_uniform_params`, `test_gate_p2_thread_count_direct`, `test_gate_p2_sv_group_id_absent_dim0`, `test_step21_scalar_uses_entrypoint_args` |
+| `test_entrypoint_params_vectorized` (#36) | Vectorized fast path: shape arrays as entry-point params, `SV_GroupID`/`SV_GroupIndex` present, no `struct CallData` | (new — covers vectorized entry-point param path) |
+| `test_entrypoint_params_non_direct_bind` (#37) | Non-direct-bind arg (WangHashArg) on fast path: no `struct CallData`, wrapper type used, `__slangpy_load`/`Context` present | `test_gate_p2_wanghasharg_keeps_load`, `test_step21_wanghasharg_uses_entrypoint_args` |
+| `test_bwds_entrypoint_no_diff_params` (#38) | Bwds fast path: trampoline params have `no_diff` and `__in_` prefix, `bwd_diff(_trampoline)` passes individual args, `[Differentiable]` before trampoline | (new — covers Step 2.4 bwds trampoline) |
+| `test_fallback_calldata_large_params` (#40) | Fallback path: 8×float4x4 exceeds threshold → `ParameterBlock<CallData>`, `call_data._thread_count`; CUDA stays fast path | `test_step21_many_float4x4_may_exceed_vulkan` (adds codegen assertions) |
+
+**Post-implementation tests** — to be added when remaining steps are complete:
+
+| Test | Verifies | Blocked on |
+|------|----------|------------|
+| `test_phase2_no_trampoline_prim` | No `void _trampoline(` for prim-mode calls | Step 2.3 |
+| `test_phase2_inline_call` | Function call inlined directly in `compute_main` | Step 2.3 |
+| `test_phase2_no_context_all_direct` | No `Context __slangpy_context__` when all args direct-bind | Step 2.3 |
+| `test_phase2_fallback_no_trampoline_prim` | Even fallback path eliminates trampoline in prim mode | Step 2.3 |
+
+---
+
+### Implementation Order
+
+1. **Step 2.0** ✅ — Gating tests (baseline documentation)
+2. **Step 2.1** ✅ — Fast/fallback determination + size query
+3. **Step 2.2 + 2.5** ✅ — Code gen + C++ dispatch for entry-point params + `CallDataMode` removal (landed together)
+4. **Step 2.4** ✅ — Bwds trampoline with individual params (fast path) — `no_diff` on all params
+5. **Step 2.3** — Trampoline elimination for prim mode (both paths)
+6. **Step 2.6** — `_result` as `RWStructuredBuffer<T>` for all-direct-bind case
+7. **Step 2.7** — Post-implementation tests + functional tests
+
+**Note:** Implementation order deviated from original plan — Steps 2.2 + 2.5 were done before 2.3 (trampoline elimination), combined with `CallDataMode` removal. Step 2.4 done — all trampoline params use `no_diff` without IO modifiers.
+
+---
+
+### Key Files
+
+| File | Changes |
+|------|---------|
+| [slangpy/core/calldata.py](slangpy/core/calldata.py) | ✅ `use_entrypoint_args` flag, size threshold check, `CallDataMode` removed |
+| [slangpy/core/callsignature.py](slangpy/core/callsignature.py) | ✅ Entry-point params, fast/fallback code paths, `is_entry_point` branch removed. Trampoline still generated (Step 2.3 pending). Bwds `no_diff` on all trampoline params (Step 2.4 done). |
+| [slangpy/bindings/codegen.py](slangpy/bindings/codegen.py) | ✅ `skip_call_data` flag, `entry_point_params` list |
+| [slangpy/bindings/boundvariable.py](slangpy/bindings/boundvariable.py) | ✅ `gen_call_data_code` depth-0 entry-point path. `_gen_trampoline_argument()` unused — inline generation in `callsignature.py` used instead. |
+| [slangpy/bindings/marshall.py](slangpy/bindings/marshall.py) | ✅ `use_entrypoint_args` field on `BindContext`, `CallDataMode` removed |
+| [src/slangpy_ext/utils/slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) | ✅ `use_entrypoint_args` binding; `bind_call_data` fast path via `find_entry_point(0)`, `CallDataMode` branches removed |
+| [src/slangpy_ext/utils/slangpy.h](src/slangpy_ext/utils/slangpy.h) | ✅ `m_use_entrypoint_args` on `NativeCallData`; `m_call_data_mode` removed |
+| [src/sgl/device/device.h](src/sgl/device/device.h) | ✅ `max_entry_point_uniform_size` on `DeviceLimits` |
+| [src/sgl/device/device.cpp](src/sgl/device/device.cpp) | ✅ Per-backend defaults for `max_entry_point_uniform_size` |
+| [src/slangpy_ext/device/device.cpp](src/slangpy_ext/device/device.cpp) | ✅ Python binding for `max_entry_point_uniform_size` |
+| [src/sgl/utils/slangpy.h](src/sgl/utils/slangpy.h) | ✅ `CallDataMode` enum removed |
+| [slangpy/core/dispatchdata.py](slangpy/core/dispatchdata.py) | ✅ `CallDataMode` removed |
+| [slangpy/core/packedarg.py](slangpy/core/packedarg.py) | ✅ `CallDataMode` removed |
+| [slangpy/core/function.py](slangpy/core/function.py) | ✅ `CallDataMode` removed from imports |
+| [slangpy/slangpy/__init__.pyi](slangpy/slangpy/__init__.pyi) | ✅ `CallDataMode` class and `call_data_mode` property removed |
+| [slangpy/tests/slangpy_tests/test_type_resolution.py](slangpy/tests/slangpy_tests/test_type_resolution.py) | ✅ `CallDataMode` removed from `BindContext` creation |
+| [slangpy/tests/slangpy_tests/test_kernel_gen.py](slangpy/tests/slangpy_tests/test_kernel_gen.py) | ✅ Gating tests + Step 2.1 tests updated for new behavior |
+| [slangpy/tests/slangpy_tests/test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py) | ✅ Phase 2 tests 35–38, 40 added (Step 2.7 partial) |
+
+---
+
+### Verification
+
+```bash
+# Build first (required)
+cmake --build --preset windows-msvc-debug
+
+# Run kernel gen tests
+$env:PRINT_TEST_KERNEL_GEN="1"; pytest slangpy/tests/slangpy_tests/test_kernel_gen.py -v
+
+# Run full test suite
+pytest slangpy/tests -v
+
+# Run pre-commit
+pre-commit run --all-files
+```
+
+---
+
+### PR #862 Code Review — Proposed Improvements
+
+#### High Severity
+
+**1. Potential correctness bug — fast-path shape offset caching guarded by runtime data**
+
+In [slangpy.cpp](src/slangpy_ext/utils/slangpy.cpp) `bind_call_data`, the fast-path caching block guards shape offset caching with `call_shape.size() > 0`. If the *first* call to a multi-dimensional `NativeCallData` uses `has_thread_count=true` (which returns empty `call_shape`), shape offsets won't be cached. A subsequent normal call would find `is_valid == true` but shape offsets would be uninitialized, leading to writes at garbage offsets. The fallback path is more robust, using `call_dim.is_valid()` instead.
+
+**DO NOT FIX**: Reason: The '_thread_count' is written to the call signature, so by definition a given call data would never be used in both situations.
+
+**2. Benchmark changes are debugging artifacts**
+
+[test_benchmark_autograd.py](slangpy/benchmarks/test_benchmark_autograd.py) changes `ITERATIONS` 10→100, `WARMUPS` 10→1000, `RUN_SLANGTORCH_BENCHMARK` False→True. This will make CI benchmarks 10–100× slower. Revert to original values.
+
+**FIXED**: Restored `ITERATIONS=10`, `WARMUPS=10`, `RUN_SLANGTORCH_BENCHMARK=False`.
+
+**3. Overly broad `except Exception` in calldata.py fallback**
+
+[calldata.py](slangpy/core/calldata.py): The fallback from fast path to `ParameterBlock<CallData>` catches `except Exception`, which swallows `TypeError`, `KeyError`, `AttributeError`, etc. The caught exception `e` is never logged.
+
+**FIXED**: Narrowed to `except RuntimeError as e` and included `str(e)` in the debug message.
+
+---
+
+#### Medium Severity — Structural
+
+**4. `generate_code()` in callsignature.py is too long (~334 lines)**
+
+Extract into sub-functions:
+
+| Lines | Extract to | Purpose |
+|-------|-----------|---------|
+| ~L294–L339 | `_validate_and_compute_group_shape()` | Group shape validation & stride computation |
+| ~L341–L388 | `_generate_link_time_constants()` | Link-time constants (group shape/stride arrays) |
+| ~L390–L409 | `_generate_shape_params()` | Shape array & `_thread_count` param gen (fast/fallback) |
+| ~L415–L517 | `_generate_trampoline()` | Trampoline function (signature, loads, call, stores) |
+| ~L520–L565 | `_generate_entry_point_signature()` | Compute/ray-tracing entry-point signature |
+| ~L567–L604 | `_generate_kernel_body()` | Kernel body (bounds check, shape init, dispatch) |
+
+Additionally, the duplicated `data_name` computation at ~L449 and ~L497 should be extracted:
+```python
+def _data_name(x: BoundVariable, use_entrypoint_args: bool) -> str:
+    if x.create_param_block:
+        return f"_param_{x.variable_name}"
+    return f"__in_{x.variable_name}" if use_entrypoint_args else f"call_data.{x.variable_name}"
+```
+
+**DO NOT FIX** Reason: This is a complex change and will be deferred to a later step.
+
+**5. `bind_call_data` in slangpy.cpp has ~70 lines of duplicated write logic**
+
+The `reserve_data` + `write_strided_array_helper` ×3 + `write_value_helper` + `write_shader_cursor_pre_dispatch` sequence is identical between fast and fallback paths. Extract a helper that takes a `ShaderCursor`:
+
+```cpp
+auto write_uniforms = [&](ShaderCursor target) {
+    ShaderObject* so = target.shader_object();
+    void* base = so->reserve_data(offsets.field_offset, offsets.field_size);
+    // ... write shape arrays, thread_count ...
+    m_runtime->write_shader_cursor_pre_dispatch(context, cursor, target, ...);
+};
+```
+
+Fast path → `write_uniforms(ep)`, fallback → `write_uniforms(call_data_cursor)`.
+
+**FIXED**: Extracted `write_uniforms` lambda taking `(ShaderCursor target, ShaderCursor root_cursor)`. Fast path calls `write_uniforms(ep, cursor)`, fallback calls `write_uniforms(call_data_cursor, cursor)`.
+
+**6. `_try_build_shader` parameter pattern in calldata.py**
+
+Takes `use_entrypoint_args` parameter then immediately sets `self.use_entrypoint_args` and `context.use_entrypoint_args`. The method never reads the flag except to store it.
+
+**FIXED**: Caller sets `self.use_entrypoint_args` before calling; `_try_build_shader` reads `self.use_entrypoint_args` and sets `context.use_entrypoint_args`. Parameter removed.
+
+---
+
+#### Low Severity
+
+**7. Unconditional `print(code)` in test_kernel_gen.py L107** — should be guarded by `PRINT_TEST_KERNEL_GEN` env var.
+
+**FIXED**: Guarded with `if PRINT_TEST_KERNEL_GEN:` (existing module-level flag).
+
+**8. Test duplication** — ~30 tests near-identical between test_kernel_gen.py and test_code_gen.py. The merged tests in test_code_gen.py should replace the originals.
+
+**DO NOT FIX**: Reason: The kernel gen tests are temporary, designed for gating, and will be deleted once phases are complete.
+
+**9. Unused `nodes` variable** — [callsignature.py L278](slangpy/core/callsignature.py): `nodes: list[BoundVariable] = []` declared but never used.
+
+**FIXED**: Deleted unused variable.
+
+**10. Stale docstring** — [callsignature.py L275](slangpy/core/callsignature.py): Says "Generate a list of call data nodes" — doesn't match what the function does.
+
+**FIXED**: Updated to "Generate Slang kernel code for the given function call signature."
+
+**11. Missing return type annotations** — `generate_code()`, `generate_constants()`, `CallData.build()` all need `-> None`.
+
+**FIXED**: Added `-> None` to `generate_code()`, `generate_constants()`, `CallData.build()`, and `_try_build_shader()`.
+
+**12. `type_conformances: Any`** — [calldata.py](slangpy/core/calldata.py) should be `list[TypeConformance]`.
+
+**FIXED**: Changed to `list["TypeConformance"]` and added `TypeConformance` to the `from slangpy import (...)` block.
+
+**13. Bare `except:`** — [callsignature.py L59](slangpy/core/callsignature.py): `is_generic_vector` catches all exceptions including `SystemExit`. Use `except Exception:`.
+
+**FIXED**: Changed to `except Exception:`.
+
+**14. Typo: `santized_module`** — [calldata.py](slangpy/core/calldata.py): Missing 'i'. Pre-existing.
+
+**DO NOT FIX**: Reason: Cosmetic typo in a variable name that's used in multiple places. Fixing would require renaming across the file, which is low value and risks introducing bugs.
+
+**15. D3D12 `max_entry_point_uniform_size = 256` may be optimistic** — root descriptors consume some of the 64-DWORD root signature budget. Comment should note shared budget; consider smaller default.
+
+**DO NOT FIX**: Reason: More complex logic is actually needed and can be addressed later.
+
+**16. Fallback path always includes `SV_GroupID`/`SV_GroupIndex`** — even when `call_data_len == 0`. Asymmetric with fast path.
+
+**DO NOT FIX**: Reason: Can be addressed later.
+
+**17. Hash salt `"[CallData]\n"`** — emitted even when CallData struct is absent. Cosmetic.
+
+**FIXED**: Removed `"[CallData]\n"` prefix from hash salt.
+
+**18. `Tuple` import in test_code_gen.py** — should use lowercase `tuple[...]` for consistency.
+
+**FIXED**: Changed to `tuple[...]` and removed `Tuple` from typing import.
+
+---
+
+#### Additional Findings (subagent review, March 2026)
+
+**19. Latent correctness bug — `can_direct_bind_common()` missing write-access guard**
+
+[boundvariable.py](slangpy/bindings/boundvariable.py) `can_direct_bind_common()` does not check whether the binding has write access. This creates an inconsistency:
+
+- `ValueRefMarshall.can_direct_bind()` explicitly rejects writable bindings — correct
+- `StructMarshall.can_direct_bind()` with children checks `access[0] == AccessType.read` — correct
+- `StructMarshall.can_direct_bind()` without children falls through to `can_direct_bind_common()` — **missing access check**
+- `ValueMarshall.can_direct_bind()` delegates entirely to `can_direct_bind_common()` — safe in practice (`ValueMarshall.is_writable = False`) but fragile
+
+If a writable dim-0 leaf binding gets `direct_bind=True`, `ValueMarshall.gen_trampoline_store()` returns `True` without emitting store code, silently dropping writes.
+
+**DO NOT FIX**: Reasion: This logic is subtle but correct, based on the desired behaviour.
+
+**20. Dead `_gen_trampoline_argument()` method**
+
+[boundvariable.py](slangpy/bindings/boundvariable.py) `_gen_trampoline_argument()` is never called anywhere in the codebase. The inline generation in [callsignature.py](slangpy/core/callsignature.py) replaced it.
+
+**FIXED**: Deleted the method.
+
+**21. Redundant `hasattr` guard in `calculate_direct_bind()`**
+
+[boundvariable.py](slangpy/bindings/boundvariable.py) `calculate_direct_bind()` uses `hasattr(self.python, "can_direct_bind")`, which is always `True` because `Marshall` base class defines `can_direct_bind()`. Simplify to `if self.python is not None:`.
+
+**DO NOT FIX**: Reason: For marshalls that inherit directly from NativeMarshall, this is not necessarily true.
+
+**22. Unnecessary `getattr` in `can_direct_bind_common()`**
+
+[boundvariable.py](slangpy/bindings/boundvariable.py) `can_direct_bind_common()` uses `getattr(binding, "create_param_block", False)`. `BoundVariable.__init__()` always sets `create_param_block`, so `binding.create_param_block` suffices.
+
+**FIXED**: Replaced `getattr(binding, "create_param_block", False)` with `binding.create_param_block`.
+
+**23. Wasteful `CodeGen.call_data` initialization when `skip_call_data=True`**
+
+[codegen.py](slangpy/bindings/codegen.py) `__init__` unconditionally calls `self.call_data.append_line("struct CallData")` and `begin_block()`, even when `skip_call_data=True`. The block is never serialized so there's no output impact, but it allocates a dangling block object.
+
+**DO NOT FIX**: Reason: Harmless — the block is never emitted. Restructuring `__init__` to conditionally skip initialization adds complexity for no functional benefit.
+
+**24. `entry_point_params` ownership pattern undocumented**
+
+[codegen.py](slangpy/bindings/codegen.py) collects `entry_point_params` via `boundvariable.py`, but [callsignature.py](slangpy/core/callsignature.py) reads and emits them. This cross-module ownership pattern is unconventional and lacks a comment explaining the flow.
+
+**DO NOT FIX**: Reason: `CodeGen` is already a shared state bag consumed by multiple modules. Adding a comment is fine but not blocking.
+
+**25. `direct_bind` and `use_entrypoint_args` exposed as read-write in `.pyi` stubs**
+
+[__init__.pyi](slangpy/slangpy/__init__.pyi) exposes `direct_bind` on `NativeBoundVariableRuntime` and `use_entrypoint_args` on `NativeCallData` with setters. Mutating these after first dispatch could invalidate cached cursor offsets in `NativeValueMarshall::ensure_cached`.
+
+**DO NOT FIX**: Reason: These are set during `CallData` construction before first dispatch. The cached `NativeCallData` is per-signature, so a new signature gets a fresh instance. Post-construction mutation would require going through `debug_build_call_data` which rebuilds everything. Not a practical concern.
+
+**26. No fallback-path codegen test in `test_code_gen.py`**
+
+[test_code_gen.py](slangpy/tests/slangpy_tests/test_code_gen.py) has no test that forces `use_entrypoint_args=False` (e.g., by exceeding `max_entry_point_uniform_size`) and asserts the `ParameterBlock<CallData>` codegen. The `test_step21_many_float4x4_may_exceed_vulkan` in `test_kernel_gen.py` checks the flag but not the generated code.
+
+**FIXED**: Added `test_fallback_calldata_large_params` (#40) in `test_code_gen.py` — asserts `ParameterBlock<CallData>` codegen on Vulkan/D3D12 and fast-path codegen on CUDA.
+
+**27. No test for writable `inout` struct at dim-0**
+
+No test verifies the behavior of a writable (inout) dim-0 struct with all-scalar fields. This is the scenario where Fix 19 would prevent silent write loss.
+
+**Fix**: Add after Fix 19 is applied —test a writable dim-0 struct dict to confirm `direct_bind=False`.
+
+**Status: NOT FIXED** — blocked on Fix 19.