You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: benchmarks/README.md
+48Lines changed: 48 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,8 @@ Currently supports testing attention, gemm, fused MOE, normalization, and quanti
29
29
-`trtllm_fp8_block_scale_moe` - MOE with FP8 quantized weights and block-wise scaling.
30
30
-`trtllm_fp8_per_tensor_scale_moe` - MOE with FP8 quantized weights and per-tensor scaling.
31
31
-`cutlass_fused_moe` - CUTLASS fused MoE (base/fp8/nvfp4 variants with optional TP/EP)
32
+
- MOE Communication:
33
+
-`moe_a2a_dispatch_combine` - MoE All-to-All dispatch + combine benchmark for multi-GPU expert-parallel inference. Requires `mpirun` for multi-GPU execution. Supports optional quantization (FP8, NVFP4, FP8 block-scale) and real MoE kernel computation.
32
34
- Norm:
33
35
-`rmsnorm` - Root Mean Square Layer Normalization.
34
36
-`rmsnorm_quant` - RMSNorm with FP8 quantized output.
@@ -238,6 +240,50 @@ Notes:
238
240
- FP8 MOE kernels require integer values for group parameters, while FP4 MOE kernels accept optional values.
239
241
- CUTLASS fused MoE (`cutlass_fused_moe`) ignores `--routing_method`, `--n_group`, and `--topk_group`; it computes routing via softmax+top-k internally from the provided logits.
240
242
243
+
### MoE Communication Flags (moe_a2a_dispatch_combine)
244
+
The `moe_a2a_dispatch_combine` routine benchmarks MoE All-to-All communication for multi-GPU expert-parallel inference. It must be launched with `mpirun`.
|`--real_math`| Run actual MoE kernels instead of fake computation. Requires `--intermediate_size` and `--quant_dtype` to be `nvfp4` or `fp8_block_scale`|
255
+
|`--intermediate_size`| Intermediate FFN size. Required if `--real_math` is set |
256
+
|`--max_num_tokens`| Max tokens per rank for workspace allocation. Defaults to `--num_tokens`|
257
+
|`--validate`| Run correctness validation before benchmarking using deterministic fake MoE |
258
+
|`--per_phase_timing`| Enable per-phase timing (dispatch/combine/moe_kernel). Adds slight overhead from CUDA events |
259
+
|`--nvtx`| Enable NVTX markers for Nsight Systems profiling |
0 commit comments