Summary
While characterizing ICICLE v4.0.0's BabyBear NTT on a consumer Blackwell GPU (RTX 5070, sm_120), I found that reorder_digits_and_normalize is the dominant cost of the forward NTT and leaves significant bandwidth on the table, while the butterfly kernels are already near the memory ceiling. A focused prototype suggests ~2× is recoverable on that pass via a coalesced (shared-memory-staged / COBRA-style) permutation.
Sharing in case it's useful. Full write-up, profiles, and reproducible harnesses: https://github.com/pscamillo/icicle-blackwell-ntt
Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)
Per-kernel Nsight Compute:
ntt64 — ~25% of NTT kernel time, 84.7% of DRAM peak — at the memory ceiling.
ntt32dit — ~28%, 76.5% of DRAM peak — near the ceiling.
reorder_digits_and_normalize — ~39% of NTT kernel time, 46.2% of DRAM peak, with global-store efficiency of 4 of 32 bytes per sector (12.5%) across 99.8% of sectors — the scatter signature of the digit-reversal permutation.
Recoverable headroom (prototype, same GPU / size / dtype)
| kernel |
global-store bytes/sector |
throughput |
| coalesced copy (ceiling) |
32 / 32 |
~1320 GB/s |
scattered bit-reversal (= reorder pattern; same 4/32 signature ncu flags) |
4 / 32 |
~375 GB/s |
| shared-memory tiled transpose (coalesced permutation) |
32 / 32 |
~780 GB/s (~2.1×) |
Since the reorder is ~39% of NTT kernel time, halving it maps to an estimated ~15–20% end-to-end NTT speedup. The transpose is just the coalescing primitive; a drop-in would be a COBRA-style digit-reversal (tiled staging + index reversal).
Environment
ICICLE v4.0.0 (babybear frontend built from source; CUDA backend ubuntu22-cuda122), CUDA 12.9, driver 575.57.08, RTX 5070 (sm_120), Ubuntu 24.04. Single machine; all numbers and harnesses are in the linked repo.
Questions
- Would a coalesced reimplementation of the digit-reversal/reorder pass be welcome, or is it already being addressed in newer work?
- (Minor, separate) The shipped v4.0.0 CUDA backend has no sm_120 cubin — it runs on Blackwell via
compute_89 PTX-JIT. Are official sm_120 / sm_100 binaries planned?
Happy to share more detail or test patches on Blackwell.
Summary
While characterizing ICICLE v4.0.0's BabyBear NTT on a consumer Blackwell GPU (RTX 5070, sm_120), I found that
reorder_digits_and_normalizeis the dominant cost of the forward NTT and leaves significant bandwidth on the table, while the butterfly kernels are already near the memory ceiling. A focused prototype suggests ~2× is recoverable on that pass via a coalesced (shared-memory-staged / COBRA-style) permutation.Sharing in case it's useful. Full write-up, profiles, and reproducible harnesses: https://github.com/pscamillo/icicle-blackwell-ntt
Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)
Per-kernel Nsight Compute:
ntt64— ~25% of NTT kernel time, 84.7% of DRAM peak — at the memory ceiling.ntt32dit— ~28%, 76.5% of DRAM peak — near the ceiling.reorder_digits_and_normalize— ~39% of NTT kernel time, 46.2% of DRAM peak, with global-store efficiency of 4 of 32 bytes per sector (12.5%) across 99.8% of sectors — the scatter signature of the digit-reversal permutation.Recoverable headroom (prototype, same GPU / size / dtype)
ncuflags)Since the reorder is ~39% of NTT kernel time, halving it maps to an estimated ~15–20% end-to-end NTT speedup. The transpose is just the coalescing primitive; a drop-in would be a COBRA-style digit-reversal (tiled staging + index reversal).
Environment
ICICLE v4.0.0 (babybear frontend built from source; CUDA backend
ubuntu22-cuda122), CUDA 12.9, driver 575.57.08, RTX 5070 (sm_120), Ubuntu 24.04. Single machine; all numbers and harnesses are in the linked repo.Questions
compute_89PTX-JIT. Are official sm_120 / sm_100 binaries planned?Happy to share more detail or test patches on Blackwell.