Phase 1.5 — atomr-accel-cuda Python method-level depth

Phase 1 (just landed) shipped **structural parity** for the CUDA backend: every kernel actor that's auto-spawned by `ContextActor` has a Python handle class, plus multi-dtype buffers and a representative method per actor. This issue tracks the **method-level depth** to bring the Python surface to full parity with the Rust API.

## Scope per actor

### cuBLAS (`Device.blas()` → `Blas`)
Phase 1: `gemm_f32`, `gemm_f64`, `axpy_f32`.

Add:
- Strided-batched gemm (`gemm_strided_batched_{f32,f64,f16,bf16}`).
- L1: `dot`, `nrm2`, `scal`, `asum`, `iamax`, `iamin`, `copy`, `swap`, `rot` across f32/f64/f16/bf16.
- L2: `gemv`, `ger`.
- L3: `geam`, `syrk`, `trsm`.
- f16 / bf16 gemm (currently gated behind `f16` feature; expose once).

### cuDNN (`Device.cudnn()` → `Cudnn`, feat: `cudnn`)
Phase 1: `conv2d_fwd_f32`.

Add:
- `conv2d_bwd_data_*`, `conv2d_bwd_filter_*`.
- `pool_fwd / pool_bwd` (max + avg).
- `batch_norm`, `layer_norm`, `instance_norm`, `group_norm` (fwd + bwd).
- `softmax_fwd / softmax_bwd`, `dropout_fwd`.
- `lrn_fwd`.
- `rnn_fwd / rnn_bwd` (LSTM/GRU, uni/bi-directional).
- `multihead_attn_fwd / bwd`.
- All across f32/f64/f16/bf16/i8.

### cuFFT (`Device.fft()` → `Fft`, feat: `cufft`)
Phase 1: handle only.

Add:
- `plan(rank, n, batch, kind)` → `FftPlan` Python class.
- `execute(plan, input, output, direction)` for R2C/C2R/C2C.
- 1-D / 2-D / 3-D plans, plan-many, callback support.
- Across f32/f64/f16/bf16.
- numpy↔complex marshalling (`numpy.complex64` ↔ `cufft_sys::float2`).

### cuRAND (`Device.rng()` → `RngGenerator`, feat: `curand`)
Phase 1: `set_seed`, `uniform_f32`, `normal_f32`.

Add:
- `set_generator(kind)` (pseudo / quasi).
- Per-dtype variants: `uniform_f64`, `uniform_u32`, `normal_f64`.
- Distributions: `log_normal`, `poisson`, `exponential`, `beta`, `cauchy`, `gamma`, `discrete`.

### cuSOLVER (`Solver`, feat: `cusolver`)
Phase 1: handle class only — actor not auto-spawned by `ContextActor` today.

Add:
- Spawn path: extend `ContextActor` (or via `KernelChildren::register_extra`) so `Device.solver()` resolves.
- `lu` (`getrf`/`getrs`), `qr`, `cholesky` (`potrf`), `svd`, `syevd`, `sygvd`, `hegvd`.
- Batched variants (`getrf_batched`, `potrf_batched`, `gesvdj_batched`).
- Across f32/f64/c32/c64.

### NVRTC (`NvrtcKernel`, feat: `nvrtc`)
Phase 1: `name` + `generation` probe.

Add:
- `Device.compile_kernel(name, src, headers)` → `NvrtcKernel`.
- `kernel.launch(grid, block, shared, args)` with typed `KernelArg` marshalling (scalars + `GpuBuffer*`).
- Compile cache (`NvrtcOpts`).

### CUDA graphs
Phase 1: not exposed.

Add:
- `Device.capture_graph()` context manager.
- `capture.instantiate()` → `GraphInstance`.
- `instance.launch()`, `instance.upload(stream)`.
- Replay on a stream.

### Memory ops
Phase 1: not exposed.

Add:
- `Device.allocate_managed_*(len)` → managed (UVM) buffers.
- `Device.prefetch(buf, target_device_or_cpu)`.
- `Device.advise(buf, kind)`.
- IPC: `buf.ipc_handle()` / `Device.import_ipc(handle)`.

## Non-goals
- Patterns / train / agents / telemetry / cuda-realtime (Phase 2–3, separate issues).
- cub / cutlass / flashattn / tensorrt (Phase 4).
- Async API (Phase 5).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1.5 — atomr-accel-cuda Python method-level depth #1

Scope per actor

cuBLAS (`Device.blas()` → `Blas`)

cuDNN (`Device.cudnn()` → `Cudnn`, feat: `cudnn`)

cuFFT (`Device.fft()` → `Fft`, feat: `cufft`)

cuRAND (`Device.rng()` → `RngGenerator`, feat: `curand`)

cuSOLVER (`Solver`, feat: `cusolver`)

NVRTC (`NvrtcKernel`, feat: `nvrtc`)

CUDA graphs

Memory ops

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase 1.5 — atomr-accel-cuda Python method-level depth #1

Description

Scope per actor

cuBLAS (Device.blas() → Blas)

cuDNN (Device.cudnn() → Cudnn, feat: cudnn)

cuFFT (Device.fft() → Fft, feat: cufft)

cuRAND (Device.rng() → RngGenerator, feat: curand)

cuSOLVER (Solver, feat: cusolver)

NVRTC (NvrtcKernel, feat: nvrtc)

CUDA graphs

Memory ops

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

cuBLAS (`Device.blas()` → `Blas`)

cuDNN (`Device.cudnn()` → `Cudnn`, feat: `cudnn`)

cuFFT (`Device.fft()` → `Fft`, feat: `cufft`)

cuRAND (`Device.rng()` → `RngGenerator`, feat: `curand`)

cuSOLVER (`Solver`, feat: `cusolver`)

NVRTC (`NvrtcKernel`, feat: `nvrtc`)