[Bug] Autotune picks suboptimal tactic for trtllm_fp4_block_scale_moe kernel at small batch sizes

## Summary

When using autotune for the `trtllm_fp4_block_scale_moe` kernel, we notice that for small batch sizes, autotune consistently picks suboptimal tactics compared to optimal config even for the same cache key.

## Example

For batch size 1:

**our autotune picks:**
```
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x32x256_s5_et128x32_m256x32x32_cga2x1x1_16dp256b_rM_TN_transOut_schedS_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f
```
(tile 32, s5 scheduler)

**optimal autotune picks:**
```
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2_s6_et128x16_m256x16x32_cga2x1x1_16dp256b_rM_TN_transOut_schedP2x1x2x3_biasM_bN_tma_tmaOpt_clmp_swiGlu_dynBatch_sm100f
```

## Questions

1. What could be causing autotune to select suboptimal tactics for small batch sizes?
2. Is there a script or guidance available for generating `tuning_configs` for kernels? We would like to bypass the autotune issue by using a pre-configured tuning file similar to:
   https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/tuning_configs/v0_1_trtllm_fused_moe_NVIDIA_B200.py

## Environment

- Kernel: `trtllm_fp4_block_scale_moe`
- Batch size: 1 (and other small batch sizes)
- GPU: SM100 (B200)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Autotune picks suboptimal tactic for trtllm_fp4_block_scale_moe kernel at small batch sizes #2504

Summary

Example

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Autotune picks suboptimal tactic for trtllm_fp4_block_scale_moe kernel at small batch sizes #2504

Description

Summary

Example

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions