-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[GPU]qwen3 moe support #30448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[GPU]qwen3 moe support #30448
Conversation
… when expert_no == 0 for gpu
* Build Subgraph in parallel to improve compile_model performance * SharedOpOptimization optimizes attribute visit --------- Co-authored-by: Tingqian Li <[email protected]>
7126f13
to
e7e909a
Compare
src/plugins/intel_gpu/tests/functional/subgraph_tests/dynamic/moe.cpp
Outdated
Show resolved
Hide resolved
@CuriousPanCake could you take a look? |
auto prev_moe = pattern_map.at(final_hidden_states).get_node_shared_ptr(); | ||
auto moe = ov::as_type_ptr<op::internal::MOE>(prev_moe); | ||
OPENVINO_ASSERT(config == moe->get_config(), "each expert config must be same"); | ||
moe->add_consts(static_cast<size_t>(expert_no), consts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long time taken for this (add const & copy const)? If it is time consuming, there is room to concat thoese weights and then to be copied in parallel at GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have ever done copying in parallel at GPU, but found there would be some race condition issue. Maybe put it as a TODO work?
src/plugins/intel_gpu/include/intel_gpu/plugin/program_builder.hpp
Outdated
Show resolved
Hide resolved
src/common/transformations/src/transformations/common_optimizations/fuse_moe.cpp
Outdated
Show resolved
Hide resolved
: primitive_base(id, inputs, 1, {optional_data_type()}), | ||
_config(config), | ||
_mlp_params(param), | ||
_mlp_weights_mem(wei_mem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask why we need this special memory descriptor?
If we do like
- primitive: let the merged weights be a regular input node (data)
- At transform : weight1,2,3 => concat
- then at the post weight optimization phase in the gpu plugin transform, let them to be fused to one data node by gpu
- then the gpu moe primitive will have a single data
- then it will be saved and loaded in a normal path
Could you please let me know why the above does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wei data need be repacked for all experts of moe, which is not easy to be done it in transformation stage, we put it in CreateMOEOp(src/plugins/intel_gpu/src/plugin/ops/moe.cpp) and then pass it as mlp_weights_mem to primitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM op implementation.
Support Qwen3 MoE model running with GPU plugin
Details:
Moe fusion result
Original moe(contains 128 experts) exec graph:

With this PR, it will become one single moe_expert op:

TODO:
Tickets: