Skip to content

Phase 1.5 — atomr-accel-cuda Python method-level depth #1

@cognect

Description

@cognect

Phase 1 (just landed) shipped structural parity for the CUDA backend: every kernel actor that's auto-spawned by ContextActor has a Python handle class, plus multi-dtype buffers and a representative method per actor. This issue tracks the method-level depth to bring the Python surface to full parity with the Rust API.

Scope per actor

cuBLAS (Device.blas()Blas)

Phase 1: gemm_f32, gemm_f64, axpy_f32.

Add:

  • Strided-batched gemm (gemm_strided_batched_{f32,f64,f16,bf16}).
  • L1: dot, nrm2, scal, asum, iamax, iamin, copy, swap, rot across f32/f64/f16/bf16.
  • L2: gemv, ger.
  • L3: geam, syrk, trsm.
  • f16 / bf16 gemm (currently gated behind f16 feature; expose once).

cuDNN (Device.cudnn()Cudnn, feat: cudnn)

Phase 1: conv2d_fwd_f32.

Add:

  • conv2d_bwd_data_*, conv2d_bwd_filter_*.
  • pool_fwd / pool_bwd (max + avg).
  • batch_norm, layer_norm, instance_norm, group_norm (fwd + bwd).
  • softmax_fwd / softmax_bwd, dropout_fwd.
  • lrn_fwd.
  • rnn_fwd / rnn_bwd (LSTM/GRU, uni/bi-directional).
  • multihead_attn_fwd / bwd.
  • All across f32/f64/f16/bf16/i8.

cuFFT (Device.fft()Fft, feat: cufft)

Phase 1: handle only.

Add:

  • plan(rank, n, batch, kind)FftPlan Python class.
  • execute(plan, input, output, direction) for R2C/C2R/C2C.
  • 1-D / 2-D / 3-D plans, plan-many, callback support.
  • Across f32/f64/f16/bf16.
  • numpy↔complex marshalling (numpy.complex64cufft_sys::float2).

cuRAND (Device.rng()RngGenerator, feat: curand)

Phase 1: set_seed, uniform_f32, normal_f32.

Add:

  • set_generator(kind) (pseudo / quasi).
  • Per-dtype variants: uniform_f64, uniform_u32, normal_f64.
  • Distributions: log_normal, poisson, exponential, beta, cauchy, gamma, discrete.

cuSOLVER (Solver, feat: cusolver)

Phase 1: handle class only — actor not auto-spawned by ContextActor today.

Add:

  • Spawn path: extend ContextActor (or via KernelChildren::register_extra) so Device.solver() resolves.
  • lu (getrf/getrs), qr, cholesky (potrf), svd, syevd, sygvd, hegvd.
  • Batched variants (getrf_batched, potrf_batched, gesvdj_batched).
  • Across f32/f64/c32/c64.

NVRTC (NvrtcKernel, feat: nvrtc)

Phase 1: name + generation probe.

Add:

  • Device.compile_kernel(name, src, headers)NvrtcKernel.
  • kernel.launch(grid, block, shared, args) with typed KernelArg marshalling (scalars + GpuBuffer*).
  • Compile cache (NvrtcOpts).

CUDA graphs

Phase 1: not exposed.

Add:

  • Device.capture_graph() context manager.
  • capture.instantiate()GraphInstance.
  • instance.launch(), instance.upload(stream).
  • Replay on a stream.

Memory ops

Phase 1: not exposed.

Add:

  • Device.allocate_managed_*(len) → managed (UVM) buffers.
  • Device.prefetch(buf, target_device_or_cpu).
  • Device.advise(buf, kind).
  • IPC: buf.ipc_handle() / Device.import_ipc(handle).

Non-goals

  • Patterns / train / agents / telemetry / cuda-realtime (Phase 2–3, separate issues).
  • cub / cutlass / flashattn / tensorrt (Phase 4).
  • Async API (Phase 5).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions