Phase 1 (just landed) shipped structural parity for the CUDA backend: every kernel actor that's auto-spawned by ContextActor has a Python handle class, plus multi-dtype buffers and a representative method per actor. This issue tracks the method-level depth to bring the Python surface to full parity with the Rust API.
Scope per actor
cuBLAS (Device.blas() → Blas)
Phase 1: gemm_f32, gemm_f64, axpy_f32.
Add:
- Strided-batched gemm (
gemm_strided_batched_{f32,f64,f16,bf16}).
- L1:
dot, nrm2, scal, asum, iamax, iamin, copy, swap, rot across f32/f64/f16/bf16.
- L2:
gemv, ger.
- L3:
geam, syrk, trsm.
- f16 / bf16 gemm (currently gated behind
f16 feature; expose once).
cuDNN (Device.cudnn() → Cudnn, feat: cudnn)
Phase 1: conv2d_fwd_f32.
Add:
conv2d_bwd_data_*, conv2d_bwd_filter_*.
pool_fwd / pool_bwd (max + avg).
batch_norm, layer_norm, instance_norm, group_norm (fwd + bwd).
softmax_fwd / softmax_bwd, dropout_fwd.
lrn_fwd.
rnn_fwd / rnn_bwd (LSTM/GRU, uni/bi-directional).
multihead_attn_fwd / bwd.
- All across f32/f64/f16/bf16/i8.
cuFFT (Device.fft() → Fft, feat: cufft)
Phase 1: handle only.
Add:
plan(rank, n, batch, kind) → FftPlan Python class.
execute(plan, input, output, direction) for R2C/C2R/C2C.
- 1-D / 2-D / 3-D plans, plan-many, callback support.
- Across f32/f64/f16/bf16.
- numpy↔complex marshalling (
numpy.complex64 ↔ cufft_sys::float2).
cuRAND (Device.rng() → RngGenerator, feat: curand)
Phase 1: set_seed, uniform_f32, normal_f32.
Add:
set_generator(kind) (pseudo / quasi).
- Per-dtype variants:
uniform_f64, uniform_u32, normal_f64.
- Distributions:
log_normal, poisson, exponential, beta, cauchy, gamma, discrete.
cuSOLVER (Solver, feat: cusolver)
Phase 1: handle class only — actor not auto-spawned by ContextActor today.
Add:
- Spawn path: extend
ContextActor (or via KernelChildren::register_extra) so Device.solver() resolves.
lu (getrf/getrs), qr, cholesky (potrf), svd, syevd, sygvd, hegvd.
- Batched variants (
getrf_batched, potrf_batched, gesvdj_batched).
- Across f32/f64/c32/c64.
NVRTC (NvrtcKernel, feat: nvrtc)
Phase 1: name + generation probe.
Add:
Device.compile_kernel(name, src, headers) → NvrtcKernel.
kernel.launch(grid, block, shared, args) with typed KernelArg marshalling (scalars + GpuBuffer*).
- Compile cache (
NvrtcOpts).
CUDA graphs
Phase 1: not exposed.
Add:
Device.capture_graph() context manager.
capture.instantiate() → GraphInstance.
instance.launch(), instance.upload(stream).
- Replay on a stream.
Memory ops
Phase 1: not exposed.
Add:
Device.allocate_managed_*(len) → managed (UVM) buffers.
Device.prefetch(buf, target_device_or_cpu).
Device.advise(buf, kind).
- IPC:
buf.ipc_handle() / Device.import_ipc(handle).
Non-goals
- Patterns / train / agents / telemetry / cuda-realtime (Phase 2–3, separate issues).
- cub / cutlass / flashattn / tensorrt (Phase 4).
- Async API (Phase 5).
Phase 1 (just landed) shipped structural parity for the CUDA backend: every kernel actor that's auto-spawned by
ContextActorhas a Python handle class, plus multi-dtype buffers and a representative method per actor. This issue tracks the method-level depth to bring the Python surface to full parity with the Rust API.Scope per actor
cuBLAS (
Device.blas()→Blas)Phase 1:
gemm_f32,gemm_f64,axpy_f32.Add:
gemm_strided_batched_{f32,f64,f16,bf16}).dot,nrm2,scal,asum,iamax,iamin,copy,swap,rotacross f32/f64/f16/bf16.gemv,ger.geam,syrk,trsm.f16feature; expose once).cuDNN (
Device.cudnn()→Cudnn, feat:cudnn)Phase 1:
conv2d_fwd_f32.Add:
conv2d_bwd_data_*,conv2d_bwd_filter_*.pool_fwd / pool_bwd(max + avg).batch_norm,layer_norm,instance_norm,group_norm(fwd + bwd).softmax_fwd / softmax_bwd,dropout_fwd.lrn_fwd.rnn_fwd / rnn_bwd(LSTM/GRU, uni/bi-directional).multihead_attn_fwd / bwd.cuFFT (
Device.fft()→Fft, feat:cufft)Phase 1: handle only.
Add:
plan(rank, n, batch, kind)→FftPlanPython class.execute(plan, input, output, direction)for R2C/C2R/C2C.numpy.complex64↔cufft_sys::float2).cuRAND (
Device.rng()→RngGenerator, feat:curand)Phase 1:
set_seed,uniform_f32,normal_f32.Add:
set_generator(kind)(pseudo / quasi).uniform_f64,uniform_u32,normal_f64.log_normal,poisson,exponential,beta,cauchy,gamma,discrete.cuSOLVER (
Solver, feat:cusolver)Phase 1: handle class only — actor not auto-spawned by
ContextActortoday.Add:
ContextActor(or viaKernelChildren::register_extra) soDevice.solver()resolves.lu(getrf/getrs),qr,cholesky(potrf),svd,syevd,sygvd,hegvd.getrf_batched,potrf_batched,gesvdj_batched).NVRTC (
NvrtcKernel, feat:nvrtc)Phase 1:
name+generationprobe.Add:
Device.compile_kernel(name, src, headers)→NvrtcKernel.kernel.launch(grid, block, shared, args)with typedKernelArgmarshalling (scalars +GpuBuffer*).NvrtcOpts).CUDA graphs
Phase 1: not exposed.
Add:
Device.capture_graph()context manager.capture.instantiate()→GraphInstance.instance.launch(),instance.upload(stream).Memory ops
Phase 1: not exposed.
Add:
Device.allocate_managed_*(len)→ managed (UVM) buffers.Device.prefetch(buf, target_device_or_cpu).Device.advise(buf, kind).buf.ipc_handle()/Device.import_ipc(handle).Non-goals