Skip to content

NTT digit-reversal (reorder_digits_and_normalize) is memory-bound on Blackwell — ~2× recoverable via coalesced permutation #1046

@pscamillo

Description

@pscamillo

Summary

While characterizing ICICLE v4.0.0's BabyBear NTT on a consumer Blackwell GPU (RTX 5070, sm_120), I found that reorder_digits_and_normalize is the dominant cost of the forward NTT and leaves significant bandwidth on the table, while the butterfly kernels are already near the memory ceiling. A focused prototype suggests ~2× is recoverable on that pass via a coalesced (shared-memory-staged / COBRA-style) permutation.

Sharing in case it's useful. Full write-up, profiles, and reproducible harnesses: https://github.com/pscamillo/icicle-blackwell-ntt

Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)

Per-kernel Nsight Compute:

  • ntt64 — ~25% of NTT kernel time, 84.7% of DRAM peak — at the memory ceiling.
  • ntt32dit — ~28%, 76.5% of DRAM peak — near the ceiling.
  • reorder_digits_and_normalize~39% of NTT kernel time, 46.2% of DRAM peak, with global-store efficiency of 4 of 32 bytes per sector (12.5%) across 99.8% of sectors — the scatter signature of the digit-reversal permutation.

Recoverable headroom (prototype, same GPU / size / dtype)

kernel global-store bytes/sector throughput
coalesced copy (ceiling) 32 / 32 ~1320 GB/s
scattered bit-reversal (= reorder pattern; same 4/32 signature ncu flags) 4 / 32 ~375 GB/s
shared-memory tiled transpose (coalesced permutation) 32 / 32 ~780 GB/s (~2.1×)

Since the reorder is ~39% of NTT kernel time, halving it maps to an estimated ~15–20% end-to-end NTT speedup. The transpose is just the coalescing primitive; a drop-in would be a COBRA-style digit-reversal (tiled staging + index reversal).

Environment

ICICLE v4.0.0 (babybear frontend built from source; CUDA backend ubuntu22-cuda122), CUDA 12.9, driver 575.57.08, RTX 5070 (sm_120), Ubuntu 24.04. Single machine; all numbers and harnesses are in the linked repo.

Questions

  1. Would a coalesced reimplementation of the digit-reversal/reorder pass be welcome, or is it already being addressed in newer work?
  2. (Minor, separate) The shipped v4.0.0 CUDA backend has no sm_120 cubin — it runs on Blackwell via compute_89 PTX-JIT. Are official sm_120 / sm_100 binaries planned?

Happy to share more detail or test patches on Blackwell.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions