-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[TEST][GPU]qwen3 moe support #30448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[TEST][GPU]qwen3 moe support #30448
Conversation
… when expert_no == 0 for gpu
* Build Subgraph in parallel to improve compile_model performance * SharedOpOptimization optimizes attribute visit --------- Co-authored-by: Tingqian Li <[email protected]>
src/common/transformations/src/transformations/common_optimizations/fuse_moe_expert.cpp
Outdated
Show resolved
Hide resolved
ov::pass::FuseMoeExpert::FuseMoeExpert() { | ||
MATCHER_SCOPE(FuseMoeExpert); | ||
|
||
auto expert_mask = makePattern(ov::Rank(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CuriousPanCake could you provide a short transition guide from makePattern to our new symbolic infrastructure? does it cover all presented cases?
should we refine it now or handle it later?
@itikhono Directory for internal ops specifications exists here: https://github.com/openvinotoolkit/openvino/tree/master/docs/articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal |
src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_expert_down_cm.cm
Outdated
Show resolved
Hide resolved
std::unordered_map<std::pair<int, int>, onednn_kernel, PairHash> _kernels; | ||
|
||
onednn_kernel& get_kernel(int n_token, int expert_no, typed_primitive_inst<moe_expert>& instance) { | ||
auto key = std::make_pair(n_token, expert_no); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between kernels export_no=N and export_no=N+1? Can they be reused?
return std::hash<T1>()(p.first) ^ std::hash<T2>()(p.second); | ||
} | ||
}; | ||
std::unordered_map<std::pair<int, int>, onednn_kernel, PairHash> _kernels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kernels cache doesn't have any size limitations, and could potentially affect memory consumption after long execution with various cases. I'd suggest reusing existing LRU cache implementation src/plugins/intel_gpu/include/intel_gpu/runtime/lru_cache.hpp instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
auto size = get_weights_size(op); | ||
auto layout = cldnn::layout({1, 1, 1, static_cast<ov::Dimension::value_type>(size)}, ov::element::i8, cldnn::format::bfyx); | ||
auto alloc_type = p.get_engine().get_preferred_memory_allocation_type(false); | ||
auto mem = p.get_engine().allocate_memory(layout, alloc_type, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, this weights will be directly used in the kernels as is, so it would be better to use usm_device isntead of usm_host allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember get_preferred_memory_allocation_type will return usm_device, will double confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riverlijunjie , you are right, sorry for the confusion! I mistakenly thought it was the get_lockable_preferred_memory_allocation_type
call
for (int i = 0; i < expert_num; i++) { | ||
internal_buffers.emplace_back(index_layout, true); // batch | ||
internal_buffers.emplace_back(index_layout, true); // topk | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use kernel's scalar arguments for this data instead of internal_buffers, as scalars should have better performance than usm_host allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal_buffers
is a vector of cldnn::BufferDescriptor
, it will not be sent to kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, my point is that it allocates single-element buffers for the batch and top_k values. I think we shouldn't use buffers for passing these values at all - we can just pass scalar values to the kernel instead
auto expert_no = tok_p[t]; | ||
OPENVINO_ASSERT(expert_no < max_expert_num); | ||
expert_mask.batch[expert_no].push_back(b); | ||
expert_mask.topk[expert_no].push_back(t + b * max_topk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have single larger preallocated buffer for this instead of pushing_back data to reduce runtime memory allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to convert from map[token, top_expert] to map[expert, top_token], it will become complex if we use data/offset to access it.
|
||
int max_expert_num = static_cast<int>(config.expert_num), max_topk = static_cast<int>(config.topk), max_tokens = static_cast<int>(shape[0]); | ||
|
||
expert_mask.pred_flag.resize(max_expert_num, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can store these masks in the implementation to avoid reallocating them on every invocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will bring memory reset to avoid data corruption.
auto final_hidden_states_layout = instance.get_output_layout(0); | ||
|
||
// onednn path will accumulate to the output | ||
final_hidden_states_mem_ptr->fill(stream, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of out of order queue, if the output memory is reused between MOE and some previous operations, it might affect the accuracy of the previous operations, as this call could start filling the memory while the previous operation is still executing, so we should add barrier for OOO queue synchronization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need consider both OOO and In Order? Currently, it should go In Order for systolic platform, do filling here will be safe. If we need support OOO as well, it is needed to add synchronization here, but it will bring some performance issue.
|
||
expert_mask_cpu expert_mask; | ||
// Wait for topk is ready | ||
stream.finish(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use softmax_topk output event for this synchronization to reduce finish() calls number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
dnnl::memory::data_type m_a_type; | ||
int bin_post_id; | ||
|
||
static onednn_linear create(dnnl::engine eng, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reuse existing oneDNN FC/MatMul implementations instead of using oneDNN interfaces here explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means to use fully_connected_onednn? If so, we need enforce fully_connected_onednn to make it meet moe_expert_impl requirement, such as post-ops.
|
||
private: | ||
Config m_config{}; | ||
std::vector<ConstsPerExpert> m_consts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any reasons why these constants are not exposed as nodes in the graph, but instead encapsulated within this operation in that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major reason is that there will many nodes in the graph, each expert will have 3 fc(up/gate/down), each fc has 3 consts(weight/scale/zp), each layer will have 128 experts, and the model have 48 layers. The total number of these const node is about 55,000. Moving them inside the op will be both beneficial for common transformations in core part and building stage in plugin.
const std::vector<ConstsPerExpert>& get_consts() const { | ||
return m_attrs.consts; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be removed just get attributes and const member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The const node and config are frequently used in separate stages and different helper function will make the semantics more clear.
} | ||
|
||
void add_consts(int expert_no, const ConstsPerExpert& consts) { | ||
OPENVINO_ASSERT(expert_no == static_cast<int>(m_attrs.consts.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some explanation can be useful why add constant is not possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean add constant one by one? The op may have 128 * 3 * 3 constant nodes and this will make it very inefficent.
Support Qwen3 MoE model running with GPU plugin
Details:
Moe fusion result
Original moe(contains 128 experts) exec graph:

With this PR, it will become one single moe_expert op:

TODO:
Tickets: