NTT digit-reversal (reorder_digits_and_normalize) is memory-bound on Blackwell — ~2× recoverable via coalesced permutation

## Summary

While characterizing ICICLE v4.0.0's BabyBear NTT on a consumer Blackwell GPU (RTX 5070, sm_120), I found that `reorder_digits_and_normalize` is the dominant cost of the forward NTT and leaves significant bandwidth on the table, while the butterfly kernels are already near the memory ceiling. A focused prototype suggests ~2× is recoverable on that pass via a coalesced (shared-memory-staged / COBRA-style) permutation.

Sharing in case it's useful. Full write-up, profiles, and reproducible harnesses: https://github.com/pscamillo/icicle-blackwell-ntt

## Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)

Per-kernel Nsight Compute:

- `ntt64` — ~25% of NTT kernel time, **84.7%** of DRAM peak — at the memory ceiling.
- `ntt32dit` — ~28%, **76.5%** of DRAM peak — near the ceiling.
- `reorder_digits_and_normalize` — **~39% of NTT kernel time, 46.2% of DRAM peak**, with global-store efficiency of **4 of 32 bytes per sector (12.5%)** across 99.8% of sectors — the scatter signature of the digit-reversal permutation.

## Recoverable headroom (prototype, same GPU / size / dtype)

| kernel | global-store bytes/sector | throughput |
|---|---|---|
| coalesced copy (ceiling) | 32 / 32 | ~1320 GB/s |
| scattered bit-reversal (= reorder pattern; same 4/32 signature `ncu` flags) | 4 / 32 | ~375 GB/s |
| shared-memory tiled transpose (coalesced permutation) | 32 / 32 | ~780 GB/s (**~2.1×**) |

Since the reorder is ~39% of NTT kernel time, halving it maps to an estimated **~15–20% end-to-end NTT speedup**. The transpose is just the coalescing primitive; a drop-in would be a COBRA-style digit-reversal (tiled staging + index reversal).

## Environment

ICICLE v4.0.0 (babybear frontend built from source; CUDA backend `ubuntu22-cuda122`), CUDA 12.9, driver 575.57.08, RTX 5070 (sm_120), Ubuntu 24.04. Single machine; all numbers and harnesses are in the linked repo.

## Questions

1. Would a coalesced reimplementation of the digit-reversal/reorder pass be welcome, or is it already being addressed in newer work?
2. (Minor, separate) The shipped v4.0.0 CUDA backend has no sm_120 cubin — it runs on Blackwell via `compute_89` PTX-JIT. Are official sm_120 / sm_100 binaries planned?

Happy to share more detail or test patches on Blackwell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTT digit-reversal (reorder_digits_and_normalize) is memory-bound on Blackwell — ~2× recoverable via coalesced permutation #1046

Summary

Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)

Recoverable headroom (prototype, same GPU / size / dtype)

Environment

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kernel	global-store bytes/sector	throughput
coalesced copy (ceiling)	32 / 32	~1320 GB/s
scattered bit-reversal (= reorder pattern; same 4/32 signature `ncu` flags)	4 / 32	~375 GB/s
shared-memory tiled transpose (coalesced permutation)	32 / 32	~780 GB/s (~2.1×)

NTT digit-reversal (reorder_digits_and_normalize) is memory-bound on Blackwell — ~2× recoverable via coalesced permutation #1046

Description

Summary

Measurements (RTX 5070, 2^22, BabyBear, mixed-radix forward NTT)

Recoverable headroom (prototype, same GPU / size / dtype)

Environment

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions