Skip to content

[MIRROR] Feature Request: Support swizzled_input_sf for cutlass fused moe. #9

@yzh119

Description

@yzh119

Mirror of flashinfer-ai#2200

NVIDIA/TensorRT-LLM#6231 added swizzled_input_sf parameter to cutlass fused moe to specify whether the input scaling factor is swizzled or not. It would be great if this could be integrated into flashinfer.

Currently in sglang, when doing FP4 allgather or FP4 alltoall (quantize before comm), we have to swizzle after the communication so it is not fused with anything. With this change, the swizzle would be fused into moe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions