Skip to content

[Bug] Autotune picks suboptimal tactic for trtllm_fp4_block_scale_moe kernel at small batch sizes #2504

@XiaotongJiang

Description

@XiaotongJiang

Summary

When using autotune for the trtllm_fp4_block_scale_moe kernel, we notice that for small batch sizes, autotune consistently picks suboptimal tactics compared to optimal config even for the same cache key.

Example

For batch size 1:

our autotune picks:

bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x32x256_s5_et128x32_m256x32x32_cga2x1x1_16dp256b_rM_TN_transOut_schedS_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f

(tile 32, s5 scheduler)

optimal autotune picks:

bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2_s6_et128x16_m256x16x32_cga2x1x1_16dp256b_rM_TN_transOut_schedP2x1x2x3_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f

Questions

  1. What could be causing autotune to select suboptimal tactics for small batch sizes?
  2. Is there a script or guidance available for generating tuning_configs for kernels? We would like to bypass the autotune issue by using a pre-configured tuning file similar to:
    https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/tuning_configs/v0_1_trtllm_fused_moe_NVIDIA_B200.py

Environment

  • Kernel: trtllm_fp4_block_scale_moe
  • Batch size: 1 (and other small batch sizes)
  • GPU: SM100 (B200)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions