[Megatron] Add RMSNorm benchmark + measured H100 results by vaibhavjindal · Pull Request #1257 · linkedin/Liger-Kernel

vaibhavjindal · 2026-06-10T21:14:40Z

Summary

Adds benchmark/scripts/benchmark_megatron_rms_norm.py — the RMSNorm parallel to the Megatron cross-entropy benchmark landed in #1207. Provides empirical speed/memory numbers so we can point at concrete data when claiming RMSNorm wins, instead of just citing the kernel sweep.

Output goes to the shared benchmark/data/all_benchmark_data.csv (tagged kernel_name="megatron_rms_norm"), so the standard visualizer renders the plots:

python benchmark/benchmarks_visualizer.py \
    --kernel-name megatron_rms_norm --metric-name speed
python benchmark/benchmarks_visualizer.py \
    --kernel-name megatron_rms_norm --metric-name memory

Providers compared

liger — LigerMegatronRMSNorm (Liger's Triton RMSNorm via the Megatron-shaped wrapper from [Megatron] Add RMSNorm integration #1254)
torch — vanilla torch.nn.RMSNorm
megatron — Megatron's WrappedTorchNorm — its __new__ returns torch.nn.RMSNorm, so timings should match torch exactly. Included for explicit parity confirmation since WrappedTorchNorm is the symbol Liger displaces in the local-backend path.

If megatron-core is not installed, the megatron provider is silently dropped and the run proceeds with liger + torch.

Results on H100 80GB (S=4096, B=1, bf16)

60 rows committed in the CSV (5 hidden sizes × 3 providers × 4 measurements).

Speed — forward

H	liger	torch	speedup
1024	0.012 ms	0.013 ms	~flat
2048	0.018 ms	0.019 ms	~6%
4096	0.029 ms	0.033 ms	~12%
8192	0.051 ms	0.074 ms	~31%
16384	0.095 ms	0.149 ms	~36%

Liger forward wins, and the gap widens with hidden size.

Speed — full (fwd + bwd)

H	liger	torch	notes
1024	0.30 ms	0.075 ms	Liger SLOWER (Triton launch overhead)
4096	0.31 ms	0.20 ms	Liger slower
8192	0.32 ms	0.40 ms	crossover
16384	0.59 ms	0.77 ms	~24% faster

The flat Liger curve at small H is the giveaway: kernel launch overhead dominates. nn.RMSNorm's backward is a single fused C++/CUDA kernel; Liger's backward launches multiple Triton kernels (dx + dw reduction + element_mul). At small H the actual compute is tiny relative to per-launch overhead, so the math wins from Triton get drowned. At H ≥ ~6K the compute dominates and Liger wins.

Memory — full

Approximately neutral across all H — Liger uses 0.5–1% more than torch, within measurement noise.

nn.RMSNorm is already a single fused CUDA kernel with minimal intermediates, so Liger doesn't get the activation-memory win it gets when replacing eager-PyTorch RMSNorm (which materializes variance + rsqrt + scale separately). Speed is the win in this comparison, not memory.

Parity sanity check

torch and megatron rows are bit-identical across the entire sweep — exactly what we'd expect since WrappedTorchNorm is a factory that returns nn.RMSNorm. Confirms Liger is replacing the right baseline.

Honest read

Production LLMs run at H ≥ 4096. At those sizes:

Forward: Liger wins ~12–36%
Full: Liger wins from ~H=6K onward
Memory: neutral

So Liger's RMSNorm is a real speed improvement for typical training shapes. The tiny-H regression on full is launch overhead, not a numerical/math issue.

Plots

Memory

Backward speed

megatron_rms_norm_speed_backward_token_length

Forward speed

megatron_rms_norm_speed_forward_token_length

Forward + Backward speed

megatron_rms_norm_speed_full_token_length

Testing Done

Hardware Type: H100 80GB HBM3
Benchmark runs end-to-end on H100 with all 3 providers active
Standard visualizer renders all 4 PNGs (3 speed modes + memory) from the shared CSV
torch / megatron providers produce bit-identical numbers (parity confirmation)
make checkstyle passes

Adds benchmark/scripts/benchmark_megatron_rms_norm.py — the RMSNorm parallel to the megatron CE benchmark landed in #1207. Compares three providers on the [seq, batch, hidden] shape used by Megatron's TransformerBlock: - **liger** LigerMegatronRMSNorm (Liger Triton kernel via the Megatron-shaped wrapper) - **torch** vanilla torch.nn.RMSNorm - **megatron** Megatron's WrappedTorchNorm — its __new__ returns torch.nn.RMSNorm, so timings should match `torch`; included for explicit parity confirmation since WrappedTorchNorm is the specific symbol Liger displaces in the local-backend path If megatron-core is not installed, the `megatron` provider is silently dropped and the run proceeds with liger + torch. Output goes to the shared benchmark/data/all_benchmark_data.csv, tagged kernel_name="megatron_rms_norm". Standard visualizer renders the plots: python benchmark/benchmarks_visualizer.py \ --kernel-name megatron_rms_norm --metric-name speed python benchmark/benchmarks_visualizer.py \ --kernel-name megatron_rms_norm --metric-name memory H100 results (S=4096, B=1, bf16; 60 rows committed in the CSV): Forward (Liger wins, gap widens with H): H=1024: liger 0.012 ms vs torch 0.013 ms ≈ flat H=4096: liger 0.029 ms vs torch 0.033 ms ~12% faster H=16384: liger 0.095 ms vs torch 0.149 ms ~36% faster Full (fwd+bwd) — crossover around H≈6K: H=1024: liger 0.30 ms vs torch 0.075 ms Liger SLOWER (Triton launch overhead dominates at tiny hidden) H=8192: roughly equal ~0.33 ms H=16384: liger 0.59 ms vs torch 0.77 ms ~24% faster Memory: ~neutral across all H (Liger 0.5–1% higher than torch). nn.RMSNorm is already a single fused CUDA kernel, so Liger doesn't reduce activation memory in this comparison — speed is the win. Parity check: `torch` and `megatron` rows are bit-identical, as expected (WrappedTorchNorm returns nn.RMSNorm). Production LLMs run at H >= 4096, where Liger wins on forward and breaks even / wins on full. The tiny-H regression is launch overhead, not a math regression — Liger's backward launches multiple Triton kernels while PyTorch's fused C++ backward is a single launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vaibhavjindal marked this pull request as ready for review June 10, 2026 21:20

vaibhavjindal requested review from Mecoli1219, kolehma8 and yueyiming2009 June 10, 2026 21:20

kolehma8 approved these changes Jun 10, 2026

View reviewed changes

vaibhavjindal added this pull request to the merge queue Jun 10, 2026

Merged via the queue into main with commit 27fe47e Jun 10, 2026
5 checks passed

vaibhavjindal deleted the feat/megatron-rms-norm-benchmark branch June 10, 2026 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Megatron] Add RMSNorm benchmark + measured H100 results#1257

[Megatron] Add RMSNorm benchmark + measured H100 results#1257
vaibhavjindal merged 1 commit into
mainfrom
feat/megatron-rms-norm-benchmark

vaibhavjindal commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vaibhavjindal commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Providers compared

Results on H100 80GB (S=4096, B=1, bf16)

Speed — forward

Speed — full (fwd + bwd)

Memory — full

Parity sanity check

Honest read

Plots

Memory

Backward speed

Forward speed

Forward + Backward speed

Testing Done

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vaibhavjindal commented Jun 10, 2026 •

edited

Loading