Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ target_compile_options(
-std=c++17
-DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED
-DNDEBUG
-Xptxas=-warn-double-usage,-warn-spills,-Werror,-v
-Xptxas=-warn-double-usage,-v
>
$<$<COMPILE_LANGUAGE:CXX>:
-Wno-unused-result
Expand Down
66 changes: 66 additions & 0 deletions benchmark/fuse_allreduce_rmsorm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Fused AllReduce + RMSNorm Benchmark

This directory reproduces the AllReduce + Residual + RMSNorm latency.

## Figure Mapping

- Operator: `RMSNorm(AllReduce(x) + residual, weight)`
- Hardware expectation: single 8-GPU SM90/H20 node with NVLink/NVSwitch
- Default dtype: BF16
- Default hidden size: `7168`
- Supported hidden sizes: `4096`, `5120`, `7168`
- Timing mode: CUDA Graph replay with per-step median latency by default
- Default samples: `--warmup 5 --iters 50 --rounds 3`

The underlying HPC-Ops kernels enforce:

```cpp
TORCH_CHECK(hidden_size == 4096 || hidden_size == 5120 || hidden_size == 7168,
"unsupported hidden_size");
```

Passing any other `--hidden` value is rejected by the benchmark before workers
are launched, so users see the supported shape list directly instead of a lower
level kernel error.

## Recommended Reproduction Command

Run from the repository root:

```bash
cd benchmark/fuse_allreduce_rmsorm/
python3 bench_allreduce_rmsnorm.py \
--hidden 7168 \
--tokens 8 32 128 512 4096 8192 16384 32768 \
--fi-backend mnnvl \
--csv allreduce_rmsnorm.csv \
--jsonl allreduce_rmsnorm.jsonl
```

The benchmark spawns 8 local worker processes itself, so `torchrun` is not required.
The default timing path is aligned with the FusedMoE replay methodology at the
CUDA Graph level: warmup, graph capture, replay warmup, then per-step median
latency. Each measured graph replay is preceded by a rank-level synchronize and
barrier. This keeps collective kernels in lockstep so peer launch jitter is not
counted as device latency. The benchmark repeats timing for several rounds and
reports the best round median, while printing all round medians in the log. This
reduces sensitivity to occasional OS scheduling or fabric noise. Use `--no-graph`
for eager event timing. Nsight Systems profiling is not enabled by default because
this benchmark launches 8 local collective worker processes.

If a provider is not explicitly listed in `--skip` but fails to import, initialize,
or run in the current environment, the benchmark prints a warning and skips that
provider instead of aborting the whole sweep. This is useful when FlashInfer or
NCCL/HPC-Ops dependencies are not available in a local reproduction environment.

## Output Fields

- `hpc_ops_ht_us`: latency of `fuse_allreduce_rmsnorm_high_throughout`
- `hpc_ops_ll_us`: latency of `fuse_allreduce_rmsnorm_low_latency`
- `nccl_us`: NCCL AllReduce + fused add/RMSNorm baseline
- `flashinfer_us`: FlashInfer baseline, if available
- `hpc_best_us`: `min(hpc_ops_ht_us, hpc_ops_ll_us)`
- `baseline_best_us`: `min(nccl_us, flashinfer_us)`
- `hpc_best_speedup`: `baseline_best_us / hpc_best_us`

Use `--no-check` to skip correctness checks.
Loading