rustakka
diff --git a/‎README.md‎
Lines changed: 63 additions & 18 deletions b/‎README.md‎
Lines changed: 63 additions & 18 deletions
diff --git a/‎ai-skills/README.md‎
Lines changed: 6 additions & 0 deletions b/‎ai-skills/README.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎ai-skills/skills/atomr-accel-cutlass/SKILL.md‎
Lines changed: 208 additions & 0 deletions b/‎ai-skills/skills/atomr-accel-cutlass/SKILL.md‎
Lines changed: 208 additions & 0 deletions
@@ -78,6 +78,11 @@ supervision, typed messages, async/await throughout.
 | `atomr-accel-train`         | Distributed-training blueprints — `DataParallelTrainer`, `PipelineParallelTrainer`, `TensorParallelTrainer`, `AsyncParameterServer`, optimizer + loss enums |
 | `atomr-accel-agents`        | LLM blueprints — `RagPipeline` (with `EmbeddingCache` LRU + `CpuVectorIndex`), `SharedGpuStateCoordinator`, `LangGraphGpuActor` (DAG executor with cycle detection) |
 | `atomr-accel-cuda-realtime` | NVRTC-backed realtime sims — `ImageFilterPipeline`, `ParticleSystemActor`, `ClothSimulationActor`, `FluidSimulationActor`, `SpatialIndexActor`, `GpuHashMapActor`, `GpuSparseStructureActor`, `MultiPassAnalysisActor`, `VideoEffectsGraph` |
+| `atomr-accel-cub`           | CUB device-wide primitives — `CubActor` with reduce / scan / sort / histogram / select / partition / segmented-reduce dispatchers, NVRTC-templated per `(op, dtype, length-class)` |
+| `atomr-accel-cutlass`       | CUTLASS kernel-template instantiation — `CutlassActor` for GEMM, grouped-GEMM, implicit-GEMM convolution, EVT (epilogue visitor tree), via NVRTC against vendored headers |
+| `atomr-accel-flashattn`     | FlashAttention v2 + v3 kernels — `FlashAttnActor` with forward/backward, paged KV-cache, chunked prefill, varlen, ALiBi, sliding window, sink tokens, MQA/GQA, fp8 (fa3 only) |
+| `atomr-accel-tensorrt`      | TensorRT engine builder + runtime — `TrtActor`, `IBuilderConfig` (fp32/fp16/bf16/int8/fp8/best), ONNX import, INT8 calibration, FP8 PTQ, `IPluginV3` Rust trampolines |
+| `atomr-accel-telemetry`     | Observability backends — `NvtxKernelTrace` for kernel-range markers, `NvmlActor` for power/temp/ECC/clocks, `CuptiSession` for activity tracing |
 | `atomr-accel-py`            | Python bindings via PyO3 — `atomr_accel.{System, Device, GpuBuffer}`, typed exceptions, GIL-released kernel paths    |
 
 Plus a Python facade — `pip install atomr-accel` — that exposes the
@@ -207,9 +212,19 @@ the GIL-release contract, and mock-mode tests.
 | [CUDA Graphs][cuda-graph]          | `GraphActor`       | [`cuGraphInstantiate` / `cuGraphLaunch`][cuda-graph-api] | always-on |
 | [Peer-to-peer][cuda-p2p]           | `P2pTopology`      | [`cuMemcpyPeerAsync`][cuda-memcpy-peer]           | always-on    |
 
-Aggregate features: `core-libs` = `cudnn` + `cufft` + `curand` +
-`cusparse`. `training-libs` = `core-libs` + `cusolver` + `cublaslt` +
-`nvrtc` + `cutensor`. `full-cuda` = `training-libs` + `nccl`.
+Aggregate features:
+- `core-libs` = `cudnn` + `cufft` + `curand` + `cusparse` + `cutensor` + `cuda-managed`.
+- `training-libs` = `core-libs` + `cusolver` + `cublaslt` + `nvrtc`.
+- `full-cuda` = `training-libs` + `nccl` + `cuda-ipc` + `graphs-conditional`.
+- `observability-full` = `telemetry` + `nvtx-trace` + `nvml` + `cupti`.
+
+Sibling-crate gates (off by default; pull each in by enabling the
+matching feature on `atomr-accel-cuda`):
+
+- `cutlass` (+ `cutlass-evt`, `cutlass-grouped`, `cutlass-prebuilt`).
+- `flashattn` (+ `flashattn-fp8`, `flashattn-paged`).
+- `tensorrt` (+ `tensorrt-onnx`, `tensorrt-plugin`, `tensorrt-int8`, `tensorrt-fp8`).
+- `nvtx-trace`, `nvml`, `cupti` — Phase 9 telemetry backends, layered on `telemetry`.
 
 ## atomr integrations
 
@@ -363,36 +378,62 @@ use atomr_accel_cuda_realtime::prelude::*;   // particles, cloth, sparse
 ```
 
 If you're using an AI coding assistant (Claude Code, Cursor, etc.),
-[`ai-skills/`](ai-skills/) ships seven `SKILL.md` files your tool can
+[`ai-skills/`](ai-skills/) ships ten `SKILL.md` files your tool can
 pick up so the assistant gives you idiomatic atomr-accel guidance
 instead of guessing.
 
 ## Layout
 
 ```
-crates/                     Rust workspace
-crates/atomr-accel/         Backend-agnostic core (umbrella)
-crates/atomr-accel-cuda/    NVIDIA CUDA implementation
-crates/atomr-accel-*        Blueprints (patterns / train / agents / cuda-realtime)
-crates/atomr-accel-py/      PyO3 bridge (Python module: atomr_accel)
-ai-skills/                  Vendor-neutral SKILL.md files for AI assistants
-docs/                       Architecture, getting-started, concepts, features-matrix
-xtask/                      Cargo xtask (bump, verify)
+crates/                       Rust workspace
+crates/atomr-accel/           Backend-agnostic core (umbrella)
+crates/atomr-accel-cuda/      NVIDIA CUDA implementation
+crates/atomr-accel-patterns/  Universal blueprints (batching / cascade / scheduler / …)
+crates/atomr-accel-train/     Distributed-training blueprints
+crates/atomr-accel-agents/    LLM blueprints (RAG / DAG)
+crates/atomr-accel-cuda-realtime/  NVRTC-backed realtime sims
+crates/atomr-accel-cub/       CUB device-wide primitives (Phase 5)
+crates/atomr-accel-cutlass/   CUTLASS templates via NVRTC (Phase 6)
+crates/atomr-accel-flashattn/ FlashAttention v2 + v3 kernels (Phase 7)
+crates/atomr-accel-tensorrt/  TensorRT engine builder + runtime (Phase 8)
+crates/atomr-accel-telemetry/ NVTX / NVML / CUPTI observability (Phase 9)
+crates/atomr-accel-py/        PyO3 bridge (Python module: atomr_accel)
+ai-skills/                    Vendor-neutral SKILL.md files for AI assistants
+docs/                         Architecture, getting-started, concepts, features-matrix, gpu-testing
+xtask/                        Cargo xtask (bump, verify, gpu-probe, gpu-test, gpu-bench)
 ```
 
 ## Status
 
-`F2 – F9 implemented + atomr adoption complete.` The full feature
-matrix builds clean; 60+ tests pass on a no-GPU CI; the GPU-runtime
-suite covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
-SVD, and the multi-actor end-to-end smoke.
+Phases 0 – 9 of the CUDA-coverage roadmap are merged. The workspace
+ships **twelve library crates** spanning the foundation actor surface
+(`atomr-accel`, `atomr-accel-cuda`), the blueprint sub-crates
+(`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
+`atomr-accel-cuda-realtime`, `atomr-accel-py`), Phase 1 – 4 library
+expansions (full cuBLAS / cuBLASLt / cuFFT / cuRAND / cuSOLVER dtype
+matrix, cuDNN frontend graph, NCCL collective set, cuTENSOR
+contraction + reduce + permute, cuSPARSE generic API + cuSPARSELt
+2:4), Phase 5 foundations (NVRTC v2 + Hopper/Blackwell +
+`atomr-accel-cub`), and Phase 6 – 9 sibling crates
+(`atomr-accel-cutlass`, `atomr-accel-flashattn`,
+`atomr-accel-tensorrt`, `atomr-accel-telemetry`).
+
+The full feature matrix builds clean on a no-GPU host. ≈ 175 unit
+tests pass with the headline feature combo
+(`f16,cudnn,curand,cufft,nvrtc,cusolver,cusparse,cusparse-generic,cutensor,cublaslt,nccl,nvtx,cuda-ipc,cuda-managed,graphs-conditional`).
+The opt-in GPU integration suite — invoked via `cargo xtask gpu-test`
+— covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
+SVD, the dispatch tables for FlashAttention / CUTLASS / CUB, and
+real NVML probes against installed devices. See
+[`docs/gpu-testing.md`](docs/gpu-testing.md) for the suite catalog
+and the rationale for keeping it out of CI.
 
 ## Releasing
 
 `v*.*.*` git tags trigger a single `release.yml` pipeline that runs
 the verify gate, builds Python wheels (manylinux x86_64, musllinux
 x86_64, macOS universal2, Windows x86_64) + an sdist, creates a
-GitHub Release, publishes the six Rust crates to crates.io in
+GitHub Release, publishes the workspace crates to crates.io in
 topological order, and uploads wheels + sdist to PyPI via trusted
 publishing. See [`RELEASING.md`](RELEASING.md) for the end-to-end
 flow.
@@ -412,9 +453,13 @@ flow.
   smallest dep footprint that fits your goal.
 - [`docs/python-bridge.md`](docs/python-bridge.md) — Python bindings
   surface and GIL strategy.
+- [`docs/gpu-testing.md`](docs/gpu-testing.md) — opt-in GPU
+  integration suite, the three-layer gating model, and why the suite
+  is intentionally not part of CI.
 - [`ai-skills/README.md`](ai-skills/README.md) — install the skill
   bundle into Claude Code, Cursor, Codex CLI, Gemini CLI, or any
-  harness that reads `SKILL.md`.
+  harness that reads `SKILL.md`. Covers the foundation actors plus
+  per-crate skills for FlashAttention, CUTLASS, and TensorRT.
 - [`RELEASING.md`](RELEASING.md) — release pipeline, secrets,
   yanking, post-release verification.
 
 
@@ -21,6 +21,9 @@ internal release workflow.
 | `atomr-accel-python` | Using the Python bindings — `System`/`Device`/`GpuBuffer`, numpy float32 roundtrip, GIL release, mock-mode tests |
 | `atomr-accel-troubleshooting` | Diagnosing failures — feature-flag misses, `GpuRefStale`, mailbox stalls, OOM loops, no-GPU CI vs GPU-runtime gate |
 | `atomr-accel-backends` | Choosing between portable (`AccelBackend` trait) and vendor-specific (`atomr-accel-cuda`) APIs; future ROCm/Metal/oneAPI/Vulkan story |
+| `atomr-accel-flashattn` | Wiring or extending FlashAttention v2 / v3 — `FlashAttnActor`, the `(arch, dtype, head_dim, …)` dispatch table, paged KV cache, chunked prefill, varlen, fa2-vs-fa3 picking |
+| `atomr-accel-cutlass` | Wiring or extending CUTLASS templates — `CutlassActor`, `GemmRequest` / `GroupedGemmRequest` / `Conv*Request`, the EVT emitter, Strategy A (NVRTC) vs Strategy B (`cutlass-prebuilt`) |
+| `atomr-accel-tensorrt` | Wiring or extending TensorRT — `TrtActor` lifecycle (`Build` / `Deserialize` / `CreateContext` / `EnqueueOnStream` / `Refit`), ONNX import, INT8 / FP8 PTQ, IPluginV3, `DeviceActor` stream sharing |
 
 Each `SKILL.md` is a thin router: it points at canonical docs in
 this repo (`docs/*.md`, `examples/*`) and at the relevant crate's
@@ -100,6 +103,9 @@ When working on atomr-accel, consult the matching skill in
 - Python bindings / numpy / GIL           → atomr-accel-python
 - portable vs vendor-specific API choice  → atomr-accel-backends
 - feature flags / OOM / CI vs GPU         → atomr-accel-troubleshooting
+- FlashAttention v2 / v3 / paged KV       → atomr-accel-flashattn
+- CUTLASS templates / EVT / arch matrix   → atomr-accel-cutlass
+- TensorRT engines / ONNX / INT8 / FP8    → atomr-accel-tensorrt
 ```
 
 ### Gemini CLI
 
@@ -0,0 +1,208 @@
+---
+name: atomr-accel-cutlass
+description: Use when wiring or extending CUTLASS kernel templates through `atomr-accel-cutlass` — the `CutlassActor`, `GemmRequest<T>` / `GroupedGemmRequest<T>` / `ConvFwdRequest<T>` / `Dgrad` / `Wgrad`, the EVT (epilogue visitor tree) emitter, the `(template, shape, dtype, arch)` plan cache, and the Strategy A (NVRTC at runtime) vs Strategy B (`cutlass-prebuilt`, nvcc at build time) compilation choice. Triggers on adding a CUTLASS template, picking arch×dtype, hitting a plan-cache miss, choosing fp8 vs fp4, or fitting an EVT chain.
+---
+
+# CUTLASS templates
+
+This skill covers the Phase 6 sibling crate. Enable the `cutlass`
+feature on `atomr-accel-cuda` and `CutlassActor` becomes available
+alongside the other kernel actors. For the per-library kernel
+actor pattern see [`atomr-accel-kernels`](../atomr-accel-kernels/SKILL.md);
+for portable trait surface considerations see
+[`atomr-accel-backends`](../atomr-accel-backends/SKILL.md).
+
+## Compilation strategies
+
+| Strategy | When | Trade-off |
+|---|---|---|
+| **A — NVRTC at runtime** (default) | First call to a new `(template, shape, dtype, arch)` triggers an NVRTC compile, then the cubin is cached on disk via the Phase 0.6 cache. Subsequent calls are warm. | First-call latency 30–60s per kernel; downstream builds run on no-GPU hosts. |
+| **B — nvcc at build time** (`cutlass-prebuilt` feature) | `build.rs` walks a generator and emits a static archive of pre-instantiated kernels for a fixed `(op × dtype × arch)` matrix. | Fast cold start, no NVRTC at runtime. Requires `nvcc` on the build host — CI on no-GPU runners breaks. |
+
+Default to A. Switch to B for production deployments where every
+serving instance hits the same kernel matrix.
+
+## Cargo features
+
+Add to `atomr-accel-cuda` features:
+
+```toml
+features = ["cutlass", "f16"]                                  # GEMM only
+features = ["cutlass", "cutlass-grouped", "f16"]               # + grouped GEMM
+features = ["cutlass", "cutlass-evt", "f16"]                   # + EVT epilogues
+features = ["cutlass", "cutlass-prebuilt", "f16"]              # Strategy B
+```
+
+## arch × dtype support matrix
+
+| dtype | sm_80 | sm_86 | sm_89 | sm_90a | sm_100 |
+|---|:-:|:-:|:-:|:-:|:-:|
+| f32, f64, f16, bf16 | ✔ | ✔ | ✔ | ✔ | ✔ |
+| fp8 e4m3 / e5m2 | | | ✔ | ✔ | ✔ |
+| fp4 e2m1 | | | | | ✔ |
+| int8 → int32 | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+Use `is_supported_for(dtype, arch)` (or `is_fp8_supported` /
+`is_fp4_supported`) before constructing a request — building a
+`GemmRequest` in an unsupported cell still succeeds, but the
+NVRTC compile will reject the template instantiation.
+
+## Request types
+
+Every request is generic over `T: GemmSupported` (currently `f32`,
+`f64`, `f16`, `bf16`, plus the fp8 / fp4 markers under the matching
+feature) and produces a `PlanKey` for the plan cache.
+
+| Module | Request | Dispatch trait | Gate |
+|---|---|---|---|
+| `gemm` | `GemmRequest<T>` | `CutlassGemmDispatch` | always-on |
+| `grouped_gemm` | `GroupedGemmRequest<T>` | `CutlassGroupedGemmDispatch` | `grouped` |
+| `conv` | `ConvFwdRequest<T>` / `ConvDgradRequest<T>` / `ConvWgradRequest<T>` | `CutlassConvDispatch` | always-on |
+| `evt` | `EpilogueVisitorTree`, `EvtBuilder`, `EpilogueOp` | n/a (composes onto `GemmRequest`) | `evt` |
+
+## A simple GEMM
+
+```rust
+use atomr_accel_cutlass::{
+    CutlassMsg, GemmEpilogue, GemmLayout, GemmRequest, GemmShape, SmArch,
+};
+use half::f16;
+
+let req = GemmRequest::<f16> {
+    arch: SmArch::Sm90a,
+    shape: GemmShape::new(4096, 4096, 4096),
+    layout_a: GemmLayout::RowMajor,
+    layout_b: GemmLayout::ColMajor,
+    layout_c: GemmLayout::RowMajor,
+    epilogue: GemmEpilogue::LinearReLU { alpha: 1.0, beta: 0.0 },
+    /* a/b/c GpuRefs, reply channel … */
+};
+
+cutlass.tell(CutlassMsg::Gemm(Box::new(req)));
+```
+
+## EVT — fused epilogue chains
+
+`cutlass-evt` unlocks the epilogue visitor tree emitter — the way
+to chain post-GEMM ops (bias-add, activation, dropout, scale,
+quantize, reduce) into a single launch. Build with `EvtBuilder`:
+
+```rust
+#[cfg(feature = "cutlass-evt")]
+use atomr_accel_cutlass::{EpilogueOp, EpilogueVisitorTree, EvtBuilder};
+
+let tree: EpilogueVisitorTree = EvtBuilder::new()
+    .scale(1.0 / 8.0)
+    .add_bias(/* bias GpuRef */)
+    .activation(EpilogueOp::Gelu)
+    .quantize_to_fp8()
+    .build()?;
+
+let req = GemmRequest { /* … */, epilogue: tree.into_epilogue() };
+```
+
+Each EVT chain produces a unique `PlanKey` — the cache discriminates
+GEMM-with-EVT-A from GEMM-with-EVT-B without collision.
+
+## The plan cache
+
+`PlanCache` (LRU, capacity set at `CutlassActor` construction)
+stores rendered `.cu` source + lowered kernel name keyed by
+`(template_id, shape, dtype, arch, layout, epilogue)`. The cache
+saves the per-call NVRTC compile — under Strategy A a warm cache
+hit is microseconds, a miss is tens of seconds.
+
+```rust
+let props = atomr_accel_cutlass::props(/* plan_cache_capacity */ 256);
+let cutlass: ActorRef<CutlassMsg> = system.actor_of(props, "cutlass");
+```
+
+The cache is **per-actor**, not global. If you spawn multiple
+`CutlassActor`s for parallelism, each gets its own cache. The
+underlying NVRTC disk cache is shared (Phase 0.6), so the second
+actor's first call reads from disk — fast, but not as fast as an
+in-process LRU hit.
+
+## Refitting weights without recompile
+
+```rust
+use atomr_accel_cutlass::{CutlassMsg, RefitMsg};
+
+cutlass.tell(CutlassMsg::Refit {
+    msg: RefitMsg {
+        plan_key: cached_key,    // from a previous Gemm dispatch
+        weights: new_bytes,      // host-side; the actor stages them
+    },
+    reply: Box::new(|res| { /* … */ }),
+});
+```
+
+Refit is for already-compiled plans. The plan key carries the
+template + shape + dtype + arch fingerprint; new weight bytes are
+copied into the kernel's bound workspace. No NVRTC pass.
+
+## Wiring into `ContextActor`
+
+```rust
+let cutlass = system.actor_of(atomr_accel_cutlass::props(64), "cutlass");
+context.tell(ContextMsg::RegisterExtra {
+    name: "cutlass",
+    actor: cutlass.clone().into_dyn(),
+});
+```
+
+`KernelChildren::register_extra` exists exactly for siblings like
+this — the cutlass actor lives next to `BlasActor` / `CudnnActor`
+and dies with them when the context rebuilds.
+
+## Mock vs real
+
+`CutlassInner::compile_sink` is `Option<...>` so the actor records
+rendered `.cu` source + lowered kernel name into the plan cache
+even without an NVRTC actor wired in. This is the host-only test
+path — the smoke test exercises plan-cache discrimination without a
+GPU. In production set `compile_sink` to a closure that forwards
+to `atomr_accel_cuda::kernel::NvrtcActor`.
+
+## Canonical references
+
+- `crates/atomr-accel-cutlass/src/lib.rs` — public surface,
+  Strategy A/B explainer, arch×dtype matrix.
+- `crates/atomr-accel-cutlass/src/{gemm,grouped_gemm,conv,evt}.rs`
+  — one request type per file.
+- `crates/atomr-accel-cutlass/src/plan_cache.rs` — `PlanCache`
+  + `PlanKey` (`(template_id, shape, dtype, arch, layout,
+  epilogue)`).
+- `crates/atomr-accel-cutlass/src/dtype.rs` — `CutlassDtype`,
+  `is_supported_for`, `GemmSupported`, `SmArch`.
+- `crates/atomr-accel-cutlass/cutlass/include/` — vendored CUTLASS
+  headers (BSD-3-Clause).
+- `crates/atomr-accel-cutlass/tests/cutlass_smoke.rs` — arch×dtype
+  smoke test (host-only).
+- [`docs/features-matrix.md`](../../../docs/features-matrix.md) §
+  `atomr-accel-cutlass` — feature flags + transitive deps.
+
+## Common pitfalls
+
+- **Cold-start latency under Strategy A.** The first call to a new
+  shape kicks off a 30–60s NVRTC compile. Pre-warm at startup by
+  issuing a no-op `GemmRequest` for each canonical shape, or
+  switch to Strategy B if your shape catalogue is fixed.
+- **Forgetting `cutlass-prebuilt` requires nvcc.** CI fails on
+  no-GPU runners. Either keep Strategy A in CI and B in production,
+  or self-host a CUDA-equipped builder.
+- **Mixing fp8 with sm_80 / sm_86.** `is_fp8_supported(arch)` is
+  false there. The smoke test enforces this; production code
+  should call `is_supported_for` before submitting.
+- **fp4 outside Blackwell.** Only sm_100 / sm_120 supports
+  `F4E2m1`. `is_fp4_supported(arch)` returns false elsewhere.
+- **EVT without the feature.** Building an `EvtBuilder` chain
+  errors at compile time when `cutlass-evt` is off — it's not
+  plumbed through plain `GemmEpilogue`. Add the feature explicitly.
+- **Plan-cache reuse across GPUs of different arch.** `PlanKey`
+  includes `arch`, so swapping a sm_80 cubin into a sm_90a context
+  is a cache miss (correctly). Don't try to lift a cached plan to
+  a different arch by editing the key.
+- **Holding a `PlanKey` past a context rebuild.** Same `KernelHandle`
+  story as NVRTC actor — re-resolve through the actor after
+  `ContextReady` cycles.