Skip to content

[TEST][GPU]qwen3 moe support #30448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 73 commits into
base: master
Choose a base branch
from

Conversation

riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented May 7, 2025

Support Qwen3 MoE model running with GPU plugin

Details:

  • Fuse moe subgraph into single moe_expert op to decrease total ops number and improve compile_model and inference performance.
  • moe_expert primitive execution stage:
    • First token adopts onednn gemm kernels pipeline and optimized opencl kernel(gatther, scatter) to do moe execution, each expert is executed in serial.
    • Second token adopts optimized opencl kernels(mlp_gate_up, mlp_down, softmax_topk, reduce) to do multiple-experts parallel execution.
  • Moe weight of each layer is allocated in a single usm memory and create submemory from it for each expert's weights/scale/zp memory, which is helpful for second token's expert kernels parallel execution.
  • Optimize key_cache and value_cache input.
  • Only support moe: u4 weight, f16 scale, u4 zp and group_size=128, which is required by qwen3 MoE 30B model.
  • Only support systolic gpu (A770/B580/ARL/LNL), doesn't support MTL, because first token need call onednn gemm kernel.

Moe fusion result

Original moe(contains 128 experts) exec graph:
image

With this PR, it will become one single moe_expert op:
image

TODO:

  • Support more MoE patterns, current only verify and support qwen3 moe pattern.
  • Integrate optimized cm kernel for second token moe
  • Align cm kernel to use the same scale/zp layout with opencl kernel.
  • Support more moe data type: u8 weight
  • Support other subgroup size: 32, 64,256...

Tickets:

luo-cheng2021 and others added 30 commits April 8, 2025 09:29
* Build Subgraph in parallel to improve compile_model performance

* SharedOpOptimization optimizes attribute visit

---------

Co-authored-by: Tingqian Li <[email protected]>
@riverlijunjie riverlijunjie marked this pull request as ready for review May 12, 2025 13:20
@riverlijunjie riverlijunjie requested review from a team as code owners May 12, 2025 13:20
@riverlijunjie riverlijunjie requested review from itikhono and removed request for a team May 12, 2025 13:20
@itikhono
Copy link
Contributor

@praasz @mitruska do we plan to introduce some "internal" specification for ov internal ops? maybe not so formal as for official opset
PagedAttention is covered by some presentation, for Rope we only have some comments in the code (as far as I know)

ov::pass::FuseMoeExpert::FuseMoeExpert() {
MATCHER_SCOPE(FuseMoeExpert);

auto expert_mask = makePattern(ov::Rank(3));
Copy link
Contributor

@itikhono itikhono May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CuriousPanCake could you provide a short transition guide from makePattern to our new symbolic infrastructure? does it cover all presented cases?
should we refine it now or handle it later?

@mitruska
Copy link
Contributor

@praasz @mitruska do we plan to introduce some "internal" specification for ov internal ops? maybe not so formal as for official opset PagedAttention is covered by some presentation, for Rope we only have some comments in the code (as far as I know)

@itikhono Directory for internal ops specifications exists here: https://github.com/openvinotoolkit/openvino/tree/master/docs/articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal
Feel free to contribute.

@peterchen-intel peterchen-intel requested review from usstq and sshlyapn May 14, 2025 08:22
std::unordered_map<std::pair<int, int>, onednn_kernel, PairHash> _kernels;

onednn_kernel& get_kernel(int n_token, int expert_no, typed_primitive_inst<moe_expert>& instance) {
auto key = std::make_pair(n_token, expert_no);
Copy link
Contributor

@sshlyapn sshlyapn May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between kernels export_no=N and export_no=N+1? Can they be reused?

return std::hash<T1>()(p.first) ^ std::hash<T2>()(p.second);
}
};
std::unordered_map<std::pair<int, int>, onednn_kernel, PairHash> _kernels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kernels cache doesn't have any size limitations, and could potentially affect memory consumption after long execution with various cases. I'd suggest reusing existing LRU cache implementation src/plugins/intel_gpu/include/intel_gpu/runtime/lru_cache.hpp instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

auto size = get_weights_size(op);
auto layout = cldnn::layout({1, 1, 1, static_cast<ov::Dimension::value_type>(size)}, ov::element::i8, cldnn::format::bfyx);
auto alloc_type = p.get_engine().get_preferred_memory_allocation_type(false);
auto mem = p.get_engine().allocate_memory(layout, alloc_type, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, this weights will be directly used in the kernels as is, so it would be better to use usm_device isntead of usm_host allocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember get_preferred_memory_allocation_type will return usm_device, will double confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riverlijunjie , you are right, sorry for the confusion! I mistakenly thought it was the get_lockable_preferred_memory_allocation_type call

Comment on lines 766 to 769
for (int i = 0; i < expert_num; i++) {
internal_buffers.emplace_back(index_layout, true); // batch
internal_buffers.emplace_back(index_layout, true); // topk
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use kernel's scalar arguments for this data instead of internal_buffers, as scalars should have better performance than usm_host allocations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal_buffers is a vector of cldnn::BufferDescriptor, it will not be sent to kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, my point is that it allocates single-element buffers for the batch and top_k values. I think we shouldn't use buffers for passing these values at all - we can just pass scalar values to the kernel instead

auto expert_no = tok_p[t];
OPENVINO_ASSERT(expert_no < max_expert_num);
expert_mask.batch[expert_no].push_back(b);
expert_mask.topk[expert_no].push_back(t + b * max_topk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have single larger preallocated buffer for this instead of pushing_back data to reduce runtime memory allocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to convert from map[token, top_expert] to map[expert, top_token], it will become complex if we use data/offset to access it.


int max_expert_num = static_cast<int>(config.expert_num), max_topk = static_cast<int>(config.topk), max_tokens = static_cast<int>(shape[0]);

expert_mask.pred_flag.resize(max_expert_num, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can store these masks in the implementation to avoid reallocating them on every invocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will bring memory reset to avoid data corruption.

auto final_hidden_states_layout = instance.get_output_layout(0);

// onednn path will accumulate to the output
final_hidden_states_mem_ptr->fill(stream, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of out of order queue, if the output memory is reused between MOE and some previous operations, it might affect the accuracy of the previous operations, as this call could start filling the memory while the previous operation is still executing, so we should add barrier for OOO queue synchronization here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need consider both OOO and In Order? Currently, it should go In Order for systolic platform, do filling here will be safe. If we need support OOO as well, it is needed to add synchronization here, but it will bring some performance issue.


expert_mask_cpu expert_mask;
// Wait for topk is ready
stream.finish();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use softmax_topk output event for this synchronization to reduce finish() calls number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

dnnl::memory::data_type m_a_type;
int bin_post_id;

static onednn_linear create(dnnl::engine eng,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reuse existing oneDNN FC/MatMul implementations instead of using oneDNN interfaces here explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means to use fully_connected_onednn? If so, we need enforce fully_connected_onednn to make it meet moe_expert_impl requirement, such as post-ops.


private:
Config m_config{};
std::vector<ConstsPerExpert> m_consts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any reasons why these constants are not exposed as nodes in the graph, but instead encapsulated within this operation in that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major reason is that there will many nodes in the graph, each expert will have 3 fc(up/gate/down), each fc has 3 consts(weight/scale/zp), each layer will have 128 experts, and the model have 48 layers. The total number of these const node is about 55,000. Moving them inside the op will be both beneficial for common transformations in core part and building stage in plugin.

@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label May 19, 2025
Comment on lines 56 to 58
const std::vector<ConstsPerExpert>& get_consts() const {
return m_attrs.consts;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be removed just get attributes and const member.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const node and config are frequently used in separate stages and different helper function will make the semantics more clear.

}

void add_consts(int expert_no, const ConstsPerExpert& consts) {
OPENVINO_ASSERT(expert_no == static_cast<int>(m_attrs.consts.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanation can be useful why add constant is not possible

Copy link
Contributor

@luo-cheng2021 luo-cheng2021 May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean add constant one by one? The op may have 128 * 3 * 3 constant nodes and this will make it very inefficent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin category: IE Tests OpenVINO Test: plugins and common category: transformations OpenVINO Runtime library - Transformations do not merge do_not_review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants