Skip to content

Commit 6f2cf3f

Browse files
committed
docs: update README + features-matrix + ai-skills for Phase 0-9 surface
- README: new sibling-crate rows (cub/cutlass/flashattn/tensorrt/ telemetry), aggregate-features paragraph (`core-libs`/ `training-libs`/`full-cuda`/`observability-full`), expanded layout diagram, twelve-crate Status block linking to gpu-testing.md. - docs/features-matrix.md: rewritten for the 12-crate workspace with per-goal recipes (fa, CUTLASS, TensorRT, observability), full Phase 1-9 cargo-feature reference, and per-crate sections for cub/cutlass/flashattn/tensorrt/telemetry. - ai-skills: three new SKILL.md files (atomr-accel-flashattn, atomr-accel-cutlass, atomr-accel-tensorrt) covering dispatch keys, request lifecycle, IBuilderConfig, mock-vs-real wiring, and common pitfalls. ai-skills/README.md picks up matching rows + AGENTS.md bullets.
1 parent 9d92885 commit 6f2cf3f

6 files changed

Lines changed: 945 additions & 76 deletions

File tree

README.md

Lines changed: 63 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,11 @@ supervision, typed messages, async/await throughout.
7878
| `atomr-accel-train` | Distributed-training blueprints — `DataParallelTrainer`, `PipelineParallelTrainer`, `TensorParallelTrainer`, `AsyncParameterServer`, optimizer + loss enums |
7979
| `atomr-accel-agents` | LLM blueprints — `RagPipeline` (with `EmbeddingCache` LRU + `CpuVectorIndex`), `SharedGpuStateCoordinator`, `LangGraphGpuActor` (DAG executor with cycle detection) |
8080
| `atomr-accel-cuda-realtime` | NVRTC-backed realtime sims — `ImageFilterPipeline`, `ParticleSystemActor`, `ClothSimulationActor`, `FluidSimulationActor`, `SpatialIndexActor`, `GpuHashMapActor`, `GpuSparseStructureActor`, `MultiPassAnalysisActor`, `VideoEffectsGraph` |
81+
| `atomr-accel-cub` | CUB device-wide primitives — `CubActor` with reduce / scan / sort / histogram / select / partition / segmented-reduce dispatchers, NVRTC-templated per `(op, dtype, length-class)` |
82+
| `atomr-accel-cutlass` | CUTLASS kernel-template instantiation — `CutlassActor` for GEMM, grouped-GEMM, implicit-GEMM convolution, EVT (epilogue visitor tree), via NVRTC against vendored headers |
83+
| `atomr-accel-flashattn` | FlashAttention v2 + v3 kernels — `FlashAttnActor` with forward/backward, paged KV-cache, chunked prefill, varlen, ALiBi, sliding window, sink tokens, MQA/GQA, fp8 (fa3 only) |
84+
| `atomr-accel-tensorrt` | TensorRT engine builder + runtime — `TrtActor`, `IBuilderConfig` (fp32/fp16/bf16/int8/fp8/best), ONNX import, INT8 calibration, FP8 PTQ, `IPluginV3` Rust trampolines |
85+
| `atomr-accel-telemetry` | Observability backends — `NvtxKernelTrace` for kernel-range markers, `NvmlActor` for power/temp/ECC/clocks, `CuptiSession` for activity tracing |
8186
| `atomr-accel-py` | Python bindings via PyO3 — `atomr_accel.{System, Device, GpuBuffer}`, typed exceptions, GIL-released kernel paths |
8287

8388
Plus a Python facade — `pip install atomr-accel` — that exposes the
@@ -207,9 +212,19 @@ the GIL-release contract, and mock-mode tests.
207212
| [CUDA Graphs][cuda-graph] | `GraphActor` | [`cuGraphInstantiate` / `cuGraphLaunch`][cuda-graph-api] | always-on |
208213
| [Peer-to-peer][cuda-p2p] | `P2pTopology` | [`cuMemcpyPeerAsync`][cuda-memcpy-peer] | always-on |
209214

210-
Aggregate features: `core-libs` = `cudnn` + `cufft` + `curand` +
211-
`cusparse`. `training-libs` = `core-libs` + `cusolver` + `cublaslt` +
212-
`nvrtc` + `cutensor`. `full-cuda` = `training-libs` + `nccl`.
215+
Aggregate features:
216+
- `core-libs` = `cudnn` + `cufft` + `curand` + `cusparse` + `cutensor` + `cuda-managed`.
217+
- `training-libs` = `core-libs` + `cusolver` + `cublaslt` + `nvrtc`.
218+
- `full-cuda` = `training-libs` + `nccl` + `cuda-ipc` + `graphs-conditional`.
219+
- `observability-full` = `telemetry` + `nvtx-trace` + `nvml` + `cupti`.
220+
221+
Sibling-crate gates (off by default; pull each in by enabling the
222+
matching feature on `atomr-accel-cuda`):
223+
224+
- `cutlass` (+ `cutlass-evt`, `cutlass-grouped`, `cutlass-prebuilt`).
225+
- `flashattn` (+ `flashattn-fp8`, `flashattn-paged`).
226+
- `tensorrt` (+ `tensorrt-onnx`, `tensorrt-plugin`, `tensorrt-int8`, `tensorrt-fp8`).
227+
- `nvtx-trace`, `nvml`, `cupti` — Phase 9 telemetry backends, layered on `telemetry`.
213228

214229
## atomr integrations
215230

@@ -363,36 +378,62 @@ use atomr_accel_cuda_realtime::prelude::*; // particles, cloth, sparse
363378
```
364379

365380
If you're using an AI coding assistant (Claude Code, Cursor, etc.),
366-
[`ai-skills/`](ai-skills/) ships seven `SKILL.md` files your tool can
381+
[`ai-skills/`](ai-skills/) ships ten `SKILL.md` files your tool can
367382
pick up so the assistant gives you idiomatic atomr-accel guidance
368383
instead of guessing.
369384

370385
## Layout
371386

372387
```
373-
crates/ Rust workspace
374-
crates/atomr-accel/ Backend-agnostic core (umbrella)
375-
crates/atomr-accel-cuda/ NVIDIA CUDA implementation
376-
crates/atomr-accel-* Blueprints (patterns / train / agents / cuda-realtime)
377-
crates/atomr-accel-py/ PyO3 bridge (Python module: atomr_accel)
378-
ai-skills/ Vendor-neutral SKILL.md files for AI assistants
379-
docs/ Architecture, getting-started, concepts, features-matrix
380-
xtask/ Cargo xtask (bump, verify)
388+
crates/ Rust workspace
389+
crates/atomr-accel/ Backend-agnostic core (umbrella)
390+
crates/atomr-accel-cuda/ NVIDIA CUDA implementation
391+
crates/atomr-accel-patterns/ Universal blueprints (batching / cascade / scheduler / …)
392+
crates/atomr-accel-train/ Distributed-training blueprints
393+
crates/atomr-accel-agents/ LLM blueprints (RAG / DAG)
394+
crates/atomr-accel-cuda-realtime/ NVRTC-backed realtime sims
395+
crates/atomr-accel-cub/ CUB device-wide primitives (Phase 5)
396+
crates/atomr-accel-cutlass/ CUTLASS templates via NVRTC (Phase 6)
397+
crates/atomr-accel-flashattn/ FlashAttention v2 + v3 kernels (Phase 7)
398+
crates/atomr-accel-tensorrt/ TensorRT engine builder + runtime (Phase 8)
399+
crates/atomr-accel-telemetry/ NVTX / NVML / CUPTI observability (Phase 9)
400+
crates/atomr-accel-py/ PyO3 bridge (Python module: atomr_accel)
401+
ai-skills/ Vendor-neutral SKILL.md files for AI assistants
402+
docs/ Architecture, getting-started, concepts, features-matrix, gpu-testing
403+
xtask/ Cargo xtask (bump, verify, gpu-probe, gpu-test, gpu-bench)
381404
```
382405

383406
## Status
384407

385-
`F2 – F9 implemented + atomr adoption complete.` The full feature
386-
matrix builds clean; 60+ tests pass on a no-GPU CI; the GPU-runtime
387-
suite covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
388-
SVD, and the multi-actor end-to-end smoke.
408+
Phases 0 – 9 of the CUDA-coverage roadmap are merged. The workspace
409+
ships **twelve library crates** spanning the foundation actor surface
410+
(`atomr-accel`, `atomr-accel-cuda`), the blueprint sub-crates
411+
(`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
412+
`atomr-accel-cuda-realtime`, `atomr-accel-py`), Phase 1 – 4 library
413+
expansions (full cuBLAS / cuBLASLt / cuFFT / cuRAND / cuSOLVER dtype
414+
matrix, cuDNN frontend graph, NCCL collective set, cuTENSOR
415+
contraction + reduce + permute, cuSPARSE generic API + cuSPARSELt
416+
2:4), Phase 5 foundations (NVRTC v2 + Hopper/Blackwell +
417+
`atomr-accel-cub`), and Phase 6 – 9 sibling crates
418+
(`atomr-accel-cutlass`, `atomr-accel-flashattn`,
419+
`atomr-accel-tensorrt`, `atomr-accel-telemetry`).
420+
421+
The full feature matrix builds clean on a no-GPU host. ≈ 175 unit
422+
tests pass with the headline feature combo
423+
(`f16,cudnn,curand,cufft,nvrtc,cusolver,cusparse,cusparse-generic,cutensor,cublaslt,nccl,nvtx,cuda-ipc,cuda-managed,graphs-conditional`).
424+
The opt-in GPU integration suite — invoked via `cargo xtask gpu-test`
425+
— covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
426+
SVD, the dispatch tables for FlashAttention / CUTLASS / CUB, and
427+
real NVML probes against installed devices. See
428+
[`docs/gpu-testing.md`](docs/gpu-testing.md) for the suite catalog
429+
and the rationale for keeping it out of CI.
389430

390431
## Releasing
391432

392433
`v*.*.*` git tags trigger a single `release.yml` pipeline that runs
393434
the verify gate, builds Python wheels (manylinux x86_64, musllinux
394435
x86_64, macOS universal2, Windows x86_64) + an sdist, creates a
395-
GitHub Release, publishes the six Rust crates to crates.io in
436+
GitHub Release, publishes the workspace crates to crates.io in
396437
topological order, and uploads wheels + sdist to PyPI via trusted
397438
publishing. See [`RELEASING.md`](RELEASING.md) for the end-to-end
398439
flow.
@@ -412,9 +453,13 @@ flow.
412453
smallest dep footprint that fits your goal.
413454
- [`docs/python-bridge.md`](docs/python-bridge.md) — Python bindings
414455
surface and GIL strategy.
456+
- [`docs/gpu-testing.md`](docs/gpu-testing.md) — opt-in GPU
457+
integration suite, the three-layer gating model, and why the suite
458+
is intentionally not part of CI.
415459
- [`ai-skills/README.md`](ai-skills/README.md) — install the skill
416460
bundle into Claude Code, Cursor, Codex CLI, Gemini CLI, or any
417-
harness that reads `SKILL.md`.
461+
harness that reads `SKILL.md`. Covers the foundation actors plus
462+
per-crate skills for FlashAttention, CUTLASS, and TensorRT.
418463
- [`RELEASING.md`](RELEASING.md) — release pipeline, secrets,
419464
yanking, post-release verification.
420465

ai-skills/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ internal release workflow.
2121
| `atomr-accel-python` | Using the Python bindings — `System`/`Device`/`GpuBuffer`, numpy float32 roundtrip, GIL release, mock-mode tests |
2222
| `atomr-accel-troubleshooting` | Diagnosing failures — feature-flag misses, `GpuRefStale`, mailbox stalls, OOM loops, no-GPU CI vs GPU-runtime gate |
2323
| `atomr-accel-backends` | Choosing between portable (`AccelBackend` trait) and vendor-specific (`atomr-accel-cuda`) APIs; future ROCm/Metal/oneAPI/Vulkan story |
24+
| `atomr-accel-flashattn` | Wiring or extending FlashAttention v2 / v3 — `FlashAttnActor`, the `(arch, dtype, head_dim, …)` dispatch table, paged KV cache, chunked prefill, varlen, fa2-vs-fa3 picking |
25+
| `atomr-accel-cutlass` | Wiring or extending CUTLASS templates — `CutlassActor`, `GemmRequest` / `GroupedGemmRequest` / `Conv*Request`, the EVT emitter, Strategy A (NVRTC) vs Strategy B (`cutlass-prebuilt`) |
26+
| `atomr-accel-tensorrt` | Wiring or extending TensorRT — `TrtActor` lifecycle (`Build` / `Deserialize` / `CreateContext` / `EnqueueOnStream` / `Refit`), ONNX import, INT8 / FP8 PTQ, IPluginV3, `DeviceActor` stream sharing |
2427

2528
Each `SKILL.md` is a thin router: it points at canonical docs in
2629
this repo (`docs/*.md`, `examples/*`) and at the relevant crate's
@@ -100,6 +103,9 @@ When working on atomr-accel, consult the matching skill in
100103
- Python bindings / numpy / GIL → atomr-accel-python
101104
- portable vs vendor-specific API choice → atomr-accel-backends
102105
- feature flags / OOM / CI vs GPU → atomr-accel-troubleshooting
106+
- FlashAttention v2 / v3 / paged KV → atomr-accel-flashattn
107+
- CUTLASS templates / EVT / arch matrix → atomr-accel-cutlass
108+
- TensorRT engines / ONNX / INT8 / FP8 → atomr-accel-tensorrt
103109
```
104110

105111
### Gemini CLI
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
---
2+
name: atomr-accel-cutlass
3+
description: Use when wiring or extending CUTLASS kernel templates through `atomr-accel-cutlass` — the `CutlassActor`, `GemmRequest<T>` / `GroupedGemmRequest<T>` / `ConvFwdRequest<T>` / `Dgrad` / `Wgrad`, the EVT (epilogue visitor tree) emitter, the `(template, shape, dtype, arch)` plan cache, and the Strategy A (NVRTC at runtime) vs Strategy B (`cutlass-prebuilt`, nvcc at build time) compilation choice. Triggers on adding a CUTLASS template, picking arch×dtype, hitting a plan-cache miss, choosing fp8 vs fp4, or fitting an EVT chain.
4+
---
5+
6+
# CUTLASS templates
7+
8+
This skill covers the Phase 6 sibling crate. Enable the `cutlass`
9+
feature on `atomr-accel-cuda` and `CutlassActor` becomes available
10+
alongside the other kernel actors. For the per-library kernel
11+
actor pattern see [`atomr-accel-kernels`](../atomr-accel-kernels/SKILL.md);
12+
for portable trait surface considerations see
13+
[`atomr-accel-backends`](../atomr-accel-backends/SKILL.md).
14+
15+
## Compilation strategies
16+
17+
| Strategy | When | Trade-off |
18+
|---|---|---|
19+
| **A — NVRTC at runtime** (default) | First call to a new `(template, shape, dtype, arch)` triggers an NVRTC compile, then the cubin is cached on disk via the Phase 0.6 cache. Subsequent calls are warm. | First-call latency 30–60s per kernel; downstream builds run on no-GPU hosts. |
20+
| **B — nvcc at build time** (`cutlass-prebuilt` feature) | `build.rs` walks a generator and emits a static archive of pre-instantiated kernels for a fixed `(op × dtype × arch)` matrix. | Fast cold start, no NVRTC at runtime. Requires `nvcc` on the build host — CI on no-GPU runners breaks. |
21+
22+
Default to A. Switch to B for production deployments where every
23+
serving instance hits the same kernel matrix.
24+
25+
## Cargo features
26+
27+
Add to `atomr-accel-cuda` features:
28+
29+
```toml
30+
features = ["cutlass", "f16"] # GEMM only
31+
features = ["cutlass", "cutlass-grouped", "f16"] # + grouped GEMM
32+
features = ["cutlass", "cutlass-evt", "f16"] # + EVT epilogues
33+
features = ["cutlass", "cutlass-prebuilt", "f16"] # Strategy B
34+
```
35+
36+
## arch × dtype support matrix
37+
38+
| dtype | sm_80 | sm_86 | sm_89 | sm_90a | sm_100 |
39+
|---|:-:|:-:|:-:|:-:|:-:|
40+
| f32, f64, f16, bf16 ||||||
41+
| fp8 e4m3 / e5m2 | | ||||
42+
| fp4 e2m1 | | | | ||
43+
| int8 → int32 ||||||
44+
45+
Use `is_supported_for(dtype, arch)` (or `is_fp8_supported` /
46+
`is_fp4_supported`) before constructing a request — building a
47+
`GemmRequest` in an unsupported cell still succeeds, but the
48+
NVRTC compile will reject the template instantiation.
49+
50+
## Request types
51+
52+
Every request is generic over `T: GemmSupported` (currently `f32`,
53+
`f64`, `f16`, `bf16`, plus the fp8 / fp4 markers under the matching
54+
feature) and produces a `PlanKey` for the plan cache.
55+
56+
| Module | Request | Dispatch trait | Gate |
57+
|---|---|---|---|
58+
| `gemm` | `GemmRequest<T>` | `CutlassGemmDispatch` | always-on |
59+
| `grouped_gemm` | `GroupedGemmRequest<T>` | `CutlassGroupedGemmDispatch` | `grouped` |
60+
| `conv` | `ConvFwdRequest<T>` / `ConvDgradRequest<T>` / `ConvWgradRequest<T>` | `CutlassConvDispatch` | always-on |
61+
| `evt` | `EpilogueVisitorTree`, `EvtBuilder`, `EpilogueOp` | n/a (composes onto `GemmRequest`) | `evt` |
62+
63+
## A simple GEMM
64+
65+
```rust
66+
use atomr_accel_cutlass::{
67+
CutlassMsg, GemmEpilogue, GemmLayout, GemmRequest, GemmShape, SmArch,
68+
};
69+
use half::f16;
70+
71+
let req = GemmRequest::<f16> {
72+
arch: SmArch::Sm90a,
73+
shape: GemmShape::new(4096, 4096, 4096),
74+
layout_a: GemmLayout::RowMajor,
75+
layout_b: GemmLayout::ColMajor,
76+
layout_c: GemmLayout::RowMajor,
77+
epilogue: GemmEpilogue::LinearReLU { alpha: 1.0, beta: 0.0 },
78+
/* a/b/c GpuRefs, reply channel … */
79+
};
80+
81+
cutlass.tell(CutlassMsg::Gemm(Box::new(req)));
82+
```
83+
84+
## EVT — fused epilogue chains
85+
86+
`cutlass-evt` unlocks the epilogue visitor tree emitter — the way
87+
to chain post-GEMM ops (bias-add, activation, dropout, scale,
88+
quantize, reduce) into a single launch. Build with `EvtBuilder`:
89+
90+
```rust
91+
#[cfg(feature = "cutlass-evt")]
92+
use atomr_accel_cutlass::{EpilogueOp, EpilogueVisitorTree, EvtBuilder};
93+
94+
let tree: EpilogueVisitorTree = EvtBuilder::new()
95+
.scale(1.0 / 8.0)
96+
.add_bias(/* bias GpuRef */)
97+
.activation(EpilogueOp::Gelu)
98+
.quantize_to_fp8()
99+
.build()?;
100+
101+
let req = GemmRequest { /* … */, epilogue: tree.into_epilogue() };
102+
```
103+
104+
Each EVT chain produces a unique `PlanKey` — the cache discriminates
105+
GEMM-with-EVT-A from GEMM-with-EVT-B without collision.
106+
107+
## The plan cache
108+
109+
`PlanCache` (LRU, capacity set at `CutlassActor` construction)
110+
stores rendered `.cu` source + lowered kernel name keyed by
111+
`(template_id, shape, dtype, arch, layout, epilogue)`. The cache
112+
saves the per-call NVRTC compile — under Strategy A a warm cache
113+
hit is microseconds, a miss is tens of seconds.
114+
115+
```rust
116+
let props = atomr_accel_cutlass::props(/* plan_cache_capacity */ 256);
117+
let cutlass: ActorRef<CutlassMsg> = system.actor_of(props, "cutlass");
118+
```
119+
120+
The cache is **per-actor**, not global. If you spawn multiple
121+
`CutlassActor`s for parallelism, each gets its own cache. The
122+
underlying NVRTC disk cache is shared (Phase 0.6), so the second
123+
actor's first call reads from disk — fast, but not as fast as an
124+
in-process LRU hit.
125+
126+
## Refitting weights without recompile
127+
128+
```rust
129+
use atomr_accel_cutlass::{CutlassMsg, RefitMsg};
130+
131+
cutlass.tell(CutlassMsg::Refit {
132+
msg: RefitMsg {
133+
plan_key: cached_key, // from a previous Gemm dispatch
134+
weights: new_bytes, // host-side; the actor stages them
135+
},
136+
reply: Box::new(|res| { /* … */ }),
137+
});
138+
```
139+
140+
Refit is for already-compiled plans. The plan key carries the
141+
template + shape + dtype + arch fingerprint; new weight bytes are
142+
copied into the kernel's bound workspace. No NVRTC pass.
143+
144+
## Wiring into `ContextActor`
145+
146+
```rust
147+
let cutlass = system.actor_of(atomr_accel_cutlass::props(64), "cutlass");
148+
context.tell(ContextMsg::RegisterExtra {
149+
name: "cutlass",
150+
actor: cutlass.clone().into_dyn(),
151+
});
152+
```
153+
154+
`KernelChildren::register_extra` exists exactly for siblings like
155+
this — the cutlass actor lives next to `BlasActor` / `CudnnActor`
156+
and dies with them when the context rebuilds.
157+
158+
## Mock vs real
159+
160+
`CutlassInner::compile_sink` is `Option<...>` so the actor records
161+
rendered `.cu` source + lowered kernel name into the plan cache
162+
even without an NVRTC actor wired in. This is the host-only test
163+
path — the smoke test exercises plan-cache discrimination without a
164+
GPU. In production set `compile_sink` to a closure that forwards
165+
to `atomr_accel_cuda::kernel::NvrtcActor`.
166+
167+
## Canonical references
168+
169+
- `crates/atomr-accel-cutlass/src/lib.rs` — public surface,
170+
Strategy A/B explainer, arch×dtype matrix.
171+
- `crates/atomr-accel-cutlass/src/{gemm,grouped_gemm,conv,evt}.rs`
172+
— one request type per file.
173+
- `crates/atomr-accel-cutlass/src/plan_cache.rs``PlanCache`
174+
+ `PlanKey` (`(template_id, shape, dtype, arch, layout,
175+
epilogue)`).
176+
- `crates/atomr-accel-cutlass/src/dtype.rs``CutlassDtype`,
177+
`is_supported_for`, `GemmSupported`, `SmArch`.
178+
- `crates/atomr-accel-cutlass/cutlass/include/` — vendored CUTLASS
179+
headers (BSD-3-Clause).
180+
- `crates/atomr-accel-cutlass/tests/cutlass_smoke.rs` — arch×dtype
181+
smoke test (host-only).
182+
- [`docs/features-matrix.md`](../../../docs/features-matrix.md) §
183+
`atomr-accel-cutlass` — feature flags + transitive deps.
184+
185+
## Common pitfalls
186+
187+
- **Cold-start latency under Strategy A.** The first call to a new
188+
shape kicks off a 30–60s NVRTC compile. Pre-warm at startup by
189+
issuing a no-op `GemmRequest` for each canonical shape, or
190+
switch to Strategy B if your shape catalogue is fixed.
191+
- **Forgetting `cutlass-prebuilt` requires nvcc.** CI fails on
192+
no-GPU runners. Either keep Strategy A in CI and B in production,
193+
or self-host a CUDA-equipped builder.
194+
- **Mixing fp8 with sm_80 / sm_86.** `is_fp8_supported(arch)` is
195+
false there. The smoke test enforces this; production code
196+
should call `is_supported_for` before submitting.
197+
- **fp4 outside Blackwell.** Only sm_100 / sm_120 supports
198+
`F4E2m1`. `is_fp4_supported(arch)` returns false elsewhere.
199+
- **EVT without the feature.** Building an `EvtBuilder` chain
200+
errors at compile time when `cutlass-evt` is off — it's not
201+
plumbed through plain `GemmEpilogue`. Add the feature explicitly.
202+
- **Plan-cache reuse across GPUs of different arch.** `PlanKey`
203+
includes `arch`, so swapping a sm_80 cubin into a sm_90a context
204+
is a cache miss (correctly). Don't try to lift a cached plan to
205+
a different arch by editing the key.
206+
- **Holding a `PlanKey` past a context rebuild.** Same `KernelHandle`
207+
story as NVRTC actor — re-resolve through the actor after
208+
`ContextReady` cycles.

0 commit comments

Comments
 (0)