Mirror of flashinfer-ai#2200
NVIDIA/TensorRT-LLM#6231 added swizzled_input_sf parameter to cutlass fused moe to specify whether the input scaling factor is swizzled or not. It would be great if this could be integrated into flashinfer.
Currently in sglang, when doing FP4 allgather or FP4 alltoall (quantize before comm), we have to swizzle after the communication so it is not fused with anything. With this change, the swizzle would be fused into moe.
Mirror of flashinfer-ai#2200
NVIDIA/TensorRT-LLM#6231 added swizzled_input_sf parameter to cutlass fused moe to specify whether the input scaling factor is swizzled or not. It would be great if this could be integrated into flashinfer.
Currently in sglang, when doing FP4 allgather or FP4 alltoall (quantize before comm), we have to swizzle after the communication so it is not fused with anything. With this change, the swizzle would be fused into moe.