Skip to content

[Megatron] Add RMSNorm benchmark + measured H100 results#1257

Merged
vaibhavjindal merged 1 commit into
mainfrom
feat/megatron-rms-norm-benchmark
Jun 10, 2026
Merged

[Megatron] Add RMSNorm benchmark + measured H100 results#1257
vaibhavjindal merged 1 commit into
mainfrom
feat/megatron-rms-norm-benchmark

Conversation

@vaibhavjindal

@vaibhavjindal vaibhavjindal commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds benchmark/scripts/benchmark_megatron_rms_norm.py — the RMSNorm parallel to the Megatron cross-entropy benchmark landed in #1207. Provides empirical speed/memory numbers so we can point at concrete data when claiming RMSNorm wins, instead of just citing the kernel sweep.

Output goes to the shared benchmark/data/all_benchmark_data.csv (tagged kernel_name="megatron_rms_norm"), so the standard visualizer renders the plots:

python benchmark/benchmarks_visualizer.py \
    --kernel-name megatron_rms_norm --metric-name speed
python benchmark/benchmarks_visualizer.py \
    --kernel-name megatron_rms_norm --metric-name memory

Providers compared

  • ligerLigerMegatronRMSNorm (Liger's Triton RMSNorm via the Megatron-shaped wrapper from [Megatron] Add RMSNorm integration #1254)
  • torch — vanilla torch.nn.RMSNorm
  • megatron — Megatron's WrappedTorchNorm — its __new__ returns torch.nn.RMSNorm, so timings should match torch exactly. Included for explicit parity confirmation since WrappedTorchNorm is the symbol Liger displaces in the local-backend path.

If megatron-core is not installed, the megatron provider is silently dropped and the run proceeds with liger + torch.

Results on H100 80GB (S=4096, B=1, bf16)

60 rows committed in the CSV (5 hidden sizes × 3 providers × 4 measurements).

Speed — forward

H liger torch speedup
1024 0.012 ms 0.013 ms ~flat
2048 0.018 ms 0.019 ms ~6%
4096 0.029 ms 0.033 ms ~12%
8192 0.051 ms 0.074 ms ~31%
16384 0.095 ms 0.149 ms ~36%

Liger forward wins, and the gap widens with hidden size.

Speed — full (fwd + bwd)

H liger torch notes
1024 0.30 ms 0.075 ms Liger SLOWER (Triton launch overhead)
4096 0.31 ms 0.20 ms Liger slower
8192 0.32 ms 0.40 ms crossover
16384 0.59 ms 0.77 ms ~24% faster

The flat Liger curve at small H is the giveaway: kernel launch overhead dominates. nn.RMSNorm's backward is a single fused C++/CUDA kernel; Liger's backward launches multiple Triton kernels (dx + dw reduction + element_mul). At small H the actual compute is tiny relative to per-launch overhead, so the math wins from Triton get drowned. At H ≥ ~6K the compute dominates and Liger wins.

Memory — full

Approximately neutral across all H — Liger uses 0.5–1% more than torch, within measurement noise.

nn.RMSNorm is already a single fused CUDA kernel with minimal intermediates, so Liger doesn't get the activation-memory win it gets when replacing eager-PyTorch RMSNorm (which materializes variance + rsqrt + scale separately). Speed is the win in this comparison, not memory.

Parity sanity check

torch and megatron rows are bit-identical across the entire sweep — exactly what we'd expect since WrappedTorchNorm is a factory that returns nn.RMSNorm. Confirms Liger is replacing the right baseline.

Honest read

Production LLMs run at H ≥ 4096. At those sizes:

  • Forward: Liger wins ~12–36%
  • Full: Liger wins from ~H=6K onward
  • Memory: neutral

So Liger's RMSNorm is a real speed improvement for typical training shapes. The tiny-H regression on full is launch overhead, not a numerical/math issue.

Plots

Memory

megatron_rms_norm_memory_full_token_length

Backward speed

megatron_rms_norm_speed_backward_token_length

Forward speed

megatron_rms_norm_speed_forward_token_length

Forward + Backward speed

megatron_rms_norm_speed_full_token_length

Testing Done

  • Hardware Type: H100 80GB HBM3
  • Benchmark runs end-to-end on H100 with all 3 providers active
  • Standard visualizer renders all 4 PNGs (3 speed modes + memory) from the shared CSV
  • torch / megatron providers produce bit-identical numbers (parity confirmation)
  • make checkstyle passes

Adds benchmark/scripts/benchmark_megatron_rms_norm.py — the RMSNorm
parallel to the megatron CE benchmark landed in #1207. Compares three
providers on the [seq, batch, hidden] shape used by Megatron's
TransformerBlock:

  - **liger**     LigerMegatronRMSNorm (Liger Triton kernel via the
                  Megatron-shaped wrapper)
  - **torch**     vanilla torch.nn.RMSNorm
  - **megatron**  Megatron's WrappedTorchNorm — its __new__ returns
                  torch.nn.RMSNorm, so timings should match `torch`;
                  included for explicit parity confirmation since
                  WrappedTorchNorm is the specific symbol Liger
                  displaces in the local-backend path

If megatron-core is not installed, the `megatron` provider is silently
dropped and the run proceeds with liger + torch.

Output goes to the shared benchmark/data/all_benchmark_data.csv,
tagged kernel_name="megatron_rms_norm". Standard visualizer renders
the plots:

    python benchmark/benchmarks_visualizer.py \
        --kernel-name megatron_rms_norm --metric-name speed
    python benchmark/benchmarks_visualizer.py \
        --kernel-name megatron_rms_norm --metric-name memory

H100 results (S=4096, B=1, bf16; 60 rows committed in the CSV):

  Forward (Liger wins, gap widens with H):
    H=1024:   liger 0.012 ms  vs  torch 0.013 ms  ≈ flat
    H=4096:   liger 0.029 ms  vs  torch 0.033 ms  ~12% faster
    H=16384:  liger 0.095 ms  vs  torch 0.149 ms  ~36% faster

  Full (fwd+bwd) — crossover around H≈6K:
    H=1024:   liger 0.30 ms  vs  torch 0.075 ms  Liger SLOWER
                  (Triton launch overhead dominates at tiny hidden)
    H=8192:   roughly equal ~0.33 ms
    H=16384:  liger 0.59 ms  vs  torch 0.77 ms  ~24% faster

  Memory: ~neutral across all H (Liger 0.5–1% higher than torch).
  nn.RMSNorm is already a single fused CUDA kernel, so Liger doesn't
  reduce activation memory in this comparison — speed is the win.

  Parity check: `torch` and `megatron` rows are bit-identical, as
  expected (WrappedTorchNorm returns nn.RMSNorm).

Production LLMs run at H >= 4096, where Liger wins on forward and
breaks even / wins on full. The tiny-H regression is launch overhead,
not a math regression — Liger's backward launches multiple Triton
kernels while PyTorch's fused C++ backward is a single launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vaibhavjindal vaibhavjindal marked this pull request as ready for review June 10, 2026 21:20
@vaibhavjindal vaibhavjindal added this pull request to the merge queue Jun 10, 2026
Merged via the queue into main with commit 27fe47e Jun 10, 2026
5 checks passed
@vaibhavjindal vaibhavjindal deleted the feat/megatron-rms-norm-benchmark branch June 10, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants