[GPU]qwen3 moe support #30448

riverlijunjie · 2025-05-07T01:18:11Z

Support Qwen3 MoE model running with GPU plugin

Details:

Fuse moe subgraph into single moe_expert op to decrease total ops number and improve compile_model and inference performance.
moe_expert primitive execution stage:
- First token adopts onednn gemm kernels pipeline and optimized opencl kernel(gatther, scatter) to do moe execution, each expert is executed in serial.
- Second token adopts optimized opencl kernels(mlp_gate_up, mlp_down, softmax_topk, reduce) to do multiple-experts parallel execution.
Moe weight of each layer is allocated in a single usm memory and create submemory from it for each expert's weights/scale/zp memory, which is helpful for second token's expert kernels parallel execution.
Optimize key_cache and value_cache input.
Only support moe: u4 weight, f16 scale, u4 zp and group_size=128, which is required by qwen3 MoE 30B model.
Only support systolic gpu (A770/B580/ARL/LNL), doesn't support MTL, because first token need call onednn gemm kernel.

Moe fusion result

Original moe(contains 128 experts) exec graph:

With this PR, it will become one single moe_expert op:

TODO:

Support more MoE patterns, current only verify and support qwen3 moe pattern.
Integrate optimized cm kernel for second token moe
Align cm kernel to use the same scale/zp layout with opencl kernel.
Support more moe data type: u8 weight
Support other subgroup size: 32, 64,256...

Tickets:

CVS-166011, CVS-168901, CVS-169299

… when expert_no == 0 for gpu

* Build Subgraph in parallel to improve compile_model performance * SharedOpOptimization optimizes attribute visit --------- Co-authored-by: Tingqian Li <[email protected]>

…g time.

src/common/transformations/include/ov_ops/moe.hpp

src/plugins/intel_gpu/src/plugin/ops/moe.cpp

src/common/transformations/src/ov_ops/moe.cpp

src/plugins/intel_gpu/src/graph/registry/moe_impls.cpp

src/plugins/intel_gpu/src/plugin/ops/moe.cpp

src/plugins/intel_gpu/tests/functional/subgraph_tests/dynamic/moe.cpp

…ailed.

itikhono · 2025-07-04T12:40:25Z

@CuriousPanCake could you take a look?

yeonbok · 2025-07-07T04:59:56Z

src/common/transformations/src/transformations/common_optimizations/fuse_moe.cpp

+            auto prev_moe = pattern_map.at(final_hidden_states).get_node_shared_ptr();
+            auto moe = ov::as_type_ptr<op::internal::MOE>(prev_moe);
+            OPENVINO_ASSERT(config == moe->get_config(), "each expert config must be same");
+            moe->add_consts(static_cast<size_t>(expert_no), consts);


How long time taken for this (add const & copy const)? If it is time consuming, there is room to concat thoese weights and then to be copied in parallel at GPU.

We have ever done copying in parallel at GPU, but found there would be some race condition issue. Maybe put it as a TODO work?

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

src/plugins/intel_gpu/include/intel_gpu/plugin/program_builder.hpp

src/common/transformations/src/transformations/common_optimizations/fuse_moe.cpp

yeonbok · 2025-07-07T05:54:52Z

src/plugins/intel_gpu/include/intel_gpu/primitives/moe.hpp

+        : primitive_base(id, inputs, 1, {optional_data_type()}),
+          _config(config),
+          _mlp_params(param),
+          _mlp_weights_mem(wei_mem) {


Can I ask why we need this special memory descriptor?
If we do like

primitive: let the merged weights be a regular input node (data)

At transform : weight1,2,3 => concat

then at the post weight optimization phase in the gpu plugin transform, let them to be fused to one data node by gpu

then the gpu moe primitive will have a single data

then it will be saved and loaded in a normal path

Could you please let me know why the above does not work?

The wei data need be repacked for all experts of moe, which is not easy to be done it in transformation stage, we put it in CreateMOEOp(src/plugins/intel_gpu/src/plugin/ops/moe.cpp) and then pass it as mlp_weights_mem to primitive.

src/common/transformations/include/ov_ops/moe.hpp

praasz · 2025-07-08T11:22:39Z

src/common/transformations/src/ov_ops/moe.cpp

LGTM op implementation.

luo-cheng2021 and others added 30 commits April 8, 2025 09:29

add perf

aa10cbe

perf for sdpa/pa

3a80081

add git ignore

d209f03

insert if for moe expert

45bf16e

add moeexpert support

795e323

fix moexpert precision is always f32

d7f2602

add moeexpert support for gpu

7f4b901

opt: 1, simplify subgraph inside moeexpert; 2, only compute skip flag…

76a7d5b

… when expert_no == 0 for gpu

opt: remove nonzero->split from subgraph into moeexpert for gpu

6079269

Support Qwen3 rms kernel for input with dynamic padding

c385c8f

Add test case

ada754a

WA: moe_expert wait all inputs ready

38ded44

fix incorrect output shape computation

f84303e

add fast path for expert mask computation if no padding

df0ca20

qwen3 moe compile model opt, from 150s to 70s in LNL (#66)

00e7d9a

* Build Subgraph in parallel to improve compile_model performance * SharedOpOptimization optimizes attribute visit --------- Co-authored-by: Tingqian Li <[email protected]>

Move FuseMoeExpert2 ahead of CommonOptimizations to decrease compilin…

019262b

…g time.

not use subgraph for moeexpert

994c094

fix scale/zp layout; first expert should not be inplace

22c93ee

merge all experts into one op

7c872a0

Optimize gather and index_add performance

75b9683

fix out_of_resource error on lunarlake

2d8eb4e

Move weigts from usm_host to usm_device memory

06e436c

Add ITT for MoE

02f2331

Optimize BMG first token due to index_add kernel

b6b5f1d

opt: merge all experts into one for batch1

9383141

opt: cl code for mlp_*

c7ef4ea

change weight back to ba

76e6ed7

small tune for lunarlake

8471f6b

fuse onehot into moe

818ba1b

not wait gpu for batch1

eed40eb

peterchen-intel removed the do_not_review label Jun 9, 2025

luo-cheng2021 added 2 commits June 17, 2025 10:59

transformation support more weight datatype

41fa911

Merge remote-tracking branch 'upstream/master' into gpu/qwen3_moe_cm

e7e909a

luo-cheng2021 force-pushed the gpu/qwen3_moe_cm branch from 7126f13 to e7e909a Compare June 17, 2025 03:47

peterchen-intel requested review from p-durandin and luo-cheng2021 June 17, 2025 08:56

praasz reviewed Jun 23, 2025

View reviewed changes

luo-cheng2021 added 2 commits June 23, 2025 15:44

apply review comments

73071de

Merge remote-tracking branch 'upstream/master' into gpu/qwen3_moe_cm

2181617

luo-cheng2021 mentioned this pull request Jun 24, 2025

[CPU]qwen3 moe support #31081

Open

luo-cheng2021 and others added 3 commits June 25, 2025 12:47

fuse router+expert0 first to avoid fusing expert success but router f…

ed49b4e

…ailed.

Merge branch 'master' into gpu/qwen3_moe_cm

ad36133

solve merge conflicts

37ca973

peterchen-intel requested review from peterchen-intel and praasz July 2, 2025 05:12

peterchen-intel assigned sshlyapn and itikhono Jul 2, 2025

yeonbok reviewed Jul 7, 2025

View reviewed changes

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated Show resolved Hide resolved

yeonbok reviewed Jul 7, 2025

View reviewed changes

src/plugins/intel_gpu/include/intel_gpu/plugin/program_builder.hpp Outdated Show resolved Hide resolved

yeonbok reviewed Jul 7, 2025

View reviewed changes

src/common/transformations/src/transformations/common_optimizations/fuse_moe.cpp Outdated Show resolved Hide resolved

yeonbok reviewed Jul 7, 2025

View reviewed changes

praasz reviewed Jul 8, 2025

View reviewed changes

src/common/transformations/include/ov_ops/moe.hpp Outdated Show resolved Hide resolved

praasz reviewed Jul 8, 2025

View reviewed changes

src/common/transformations/src/ov_ops/moe.cpp

Copy link

Contributor

praasz Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM op implementation.

peterchen-intel reacted with thumbs up emoji

riverlijunjie added 2 commits July 11, 2025 10:15

Merge branch 'main' into gpu/qwen3_moe_cm

cba1763

Handle reviewer's comments

1ed2a11

peterchen-intel requested review from yeonbok and praasz July 12, 2025 01:54

rkazants self-requested a review July 14, 2025 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU]qwen3 moe support #30448

[GPU]qwen3 moe support #30448

riverlijunjie commented May 7, 2025 •

edited by peterchen-intel

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itikhono commented Jul 4, 2025

Uh oh!

yeonbok Jul 7, 2025 •

edited

Loading

Uh oh!

riverlijunjie Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeonbok Jul 7, 2025 •

edited

Loading

Uh oh!

riverlijunjie Jul 10, 2025

Uh oh!

Uh oh!

praasz Jul 8, 2025

Uh oh!

Uh oh!

[GPU]qwen3 moe support #30448

Are you sure you want to change the base?

[GPU]qwen3 moe support #30448

Conversation

riverlijunjie commented May 7, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Moe fusion result

TODO:

Tickets:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itikhono commented Jul 4, 2025

Uh oh!

yeonbok Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

riverlijunjie Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeonbok Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

riverlijunjie Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

praasz Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

riverlijunjie commented May 7, 2025 •

edited by peterchen-intel

Loading

yeonbok Jul 7, 2025 •

edited

Loading

yeonbok Jul 7, 2025 •

edited

Loading