make ScatterMoe more easily patchable for HF MoE modeling #82

winglian · 2025-12-05T14:45:23Z

This PR makes it easier to simply use something like the snippet below for a drop in replacement for all the MoEs based on qwen2 in transformers v5. The original ScatterMoEGatedMLP isn't quite usable since it relies on router instead of gate and input_linear, output_linear instead of gate_up_proj, down_proj in the v5 modeling.

register_kernel_mapping({
    "HFScatterMoEParallelExperts": {
        "cuda": {
            Mode.TRAINING: LayerRepository(
                repo_id="axolotl-ai-co/scattermoe",
                layer_name="HFScatterMoEGatedMLP",
            ),
            Mode.INFERENCE: LayerRepository(
                repo_id="axolotl-ai-co/scattermoe",
                layer_name="HFScatterMoEGatedMLP",
            ),
        },
    }
})
replace_kernel_forward_from_hub(Qwen2MoeSparseMoeBlock, "HFScatterMoEParallelExperts")

MekkCyber · 2025-12-05T14:46:11Z

cc @shawntan

danieldk · 2025-12-05T14:46:30Z

Hi, @shawntan, any chance you could review this PR?

winglian · 2025-12-05T14:47:07Z

Here's a quick SFT test of olmoe.

shawntan · 2025-12-08T18:27:16Z

Yeah this looks good and is a good idea, but have you tested it end-to-end with use_kernels=True?

I have an issue with another community kernel: #76

winglian · 2025-12-11T13:29:01Z

yes, this was tested with use_kernels=True

shawntan · 2025-12-12T19:46:16Z

Okay. I'm trying to test GraniteMoEHybrid with use_kernels=True to make sure these kernels work, but it seems the SiLU kernel does not work with non-contiguous tensors, and it seems both GraniteMoEHybrid and Qwen2MoE will be affected by this. See this comment: #76 (comment)

The decision seems to be to assert hidden_states.is_contiguous(). Which breaks both models, as far as I understand.

UPDATE: confirmed that it breaks Qwen2MoE as well.

@winglian Not fully sure how the error didn't show up when testing with use_kernels=True, but I'm interested to know how you get around it.

@MekkCyber FYI.

make ScatterMoe more easily patchable for HF MoE modeling

48aae9b

winglian requested a review from MekkCyber as a code owner December 5, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make ScatterMoe more easily patchable for HF MoE modeling #82

make ScatterMoe more easily patchable for HF MoE modeling #82

Uh oh!

winglian commented Dec 5, 2025

Uh oh!

MekkCyber commented Dec 5, 2025

Uh oh!

danieldk commented Dec 5, 2025

Uh oh!

winglian commented Dec 5, 2025

Uh oh!

shawntan commented Dec 8, 2025

Uh oh!

winglian commented Dec 11, 2025

Uh oh!

shawntan commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

make ScatterMoe more easily patchable for HF MoE modeling #82

Are you sure you want to change the base?

make ScatterMoe more easily patchable for HF MoE modeling #82

Uh oh!

Conversation

winglian commented Dec 5, 2025

Uh oh!

MekkCyber commented Dec 5, 2025

Uh oh!

danieldk commented Dec 5, 2025

Uh oh!

winglian commented Dec 5, 2025

Uh oh!

shawntan commented Dec 8, 2025

Uh oh!

winglian commented Dec 11, 2025

Uh oh!

shawntan commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shawntan commented Dec 12, 2025 •

edited

Loading