Summary
When using autotune for the trtllm_fp4_block_scale_moe kernel, we notice that for small batch sizes, autotune consistently picks suboptimal tactics compared to optimal config even for the same cache key.
Example
For batch size 1:
our autotune picks:
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x32x256_s5_et128x32_m256x32x32_cga2x1x1_16dp256b_rM_TN_transOut_schedS_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f
(tile 32, s5 scheduler)
optimal autotune picks:
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2_s6_et128x16_m256x16x32_cga2x1x1_16dp256b_rM_TN_transOut_schedP2x1x2x3_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f
Questions
- What could be causing autotune to select suboptimal tactics for small batch sizes?
- Is there a script or guidance available for generating
tuning_configs for kernels? We would like to bypass the autotune issue by using a pre-configured tuning file similar to:
https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/tuning_configs/v0_1_trtllm_fused_moe_NVIDIA_B200.py
Environment
- Kernel:
trtllm_fp4_block_scale_moe
- Batch size: 1 (and other small batch sizes)
- GPU: SM100 (B200)