[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847
[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847dmatveev merged 3 commits intoopenvinotoolkit:masterfrom
Conversation
d8f7978 to
10e6b84
Compare
10e6b84 to
97b9ea3
Compare
3dacd25 to
270410d
Compare
270410d to
7099966
Compare
Convert gather to 2D. Gather before convert. Keep gather indices as constant. Use JustInferRequest for DEVICE_ROUTED mode. Clean up transformations for DEVICE_ROUTED. Update config for DEVICE_ROUTED: BEST_PERF + not cut LM head. Refactor device routed transformation. Refactor GatherTo2DGather. Apply MoE defaults if not explicitly set in external config. Collect MoE nodes in single loop. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
multiply considered. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
7099966 to
6fbaa79
Compare
|
Rebased. |
AlexanderKalistratov
left a comment
There was a problem hiding this comment.
idk looks fine for me.
But please wait review of others
| std::dynamic_pointer_cast<ov::op::v0::Constant>(tile->input_value(1).get_node_shared_ptr()); | ||
| if (repeats_const) { | ||
| auto repeats_data = repeats_const->cast_vector<int64_t>(); | ||
| if (!repeats_data.empty() && repeats_data[0] > k_value) { |
There was a problem hiding this comment.
So it is a possible situation that repeats_data[0] <= k_value?
There was a problem hiding this comment.
In fact it's impossible.
In MoE IR, repeats_data[0] should be equal to total number of experts, k_value is the active experts number.
If repeats_data[0] <= k_value, I think this is a problematic MoE IR.
| } else { | ||
| // Constant reshape - check if dim 0 is expert dimension | ||
| auto shape_data = shape_const->cast_vector<int64_t>(); | ||
| if (nodes.num_experts > 0 && !shape_data.empty() && |
There was a problem hiding this comment.
nodes.num_experts > 0 implicitly assumes that we find Tile node first.
| } | ||
| } | ||
|
|
||
| return nodes; |
There was a problem hiding this comment.
So the whole function relies on some unobvious assumptions and layer names.
Why did you prefer it over MatcherPass and pattern matching?
There was a problem hiding this comment.
Personally, I agree pattern matching is a more general approach.
But in practical, MoE pattern contains lot's of nodes, and I will implemented a lot of code to trace the nodes chain in the pattern with pattern matching approach.
Slight pattern change may cause failure.
Also not easy for reading / debugging...
So finally, I'm using a hybrid method, name matching + little pattern checking.
| } | ||
| } | ||
|
|
||
| void transform_dynamic_reshapes(LayerNodes& nodes) { |
There was a problem hiding this comment.
Why are we doing this?
Does it helps us later?
There was a problem hiding this comment.
It's handling Reshape_2 in Expert graph whose targe_shape input is from Concat node instead of Constant.

For Constant - Reshape case, it's handled by transform_constant_reshapes to patch the constant value for shape compatibility.
For Concat - Reshape case, it's handled here by transform_dynamic_reshapes - convert it to unsqueeze for shape compatibility.
Otherwise, we will get compilation error.
dmatveev
left a comment
There was a problem hiding this comment.
Only reviewed the changes (not the existing code)
| // Helper to check if a constant is MoE Gather indices (marked by GatherTo2DGather pass) | ||
| auto is_moe_gather_const = [](const CTPtr& const_node) -> bool { | ||
| const auto& rt_info = const_node->get_rt_info(); | ||
| return rt_info.count("npuw_moe_gather_indices") > 0; | ||
| }; |
There was a problem hiding this comment.
For the record - why the existing algorithm didn't work here to save the MoE indices? What makes them special?
There was a problem hiding this comment.
@dmatveev Originally, we only keep tiny shape constant in function.
But Gather indices constant is not so "tiny" here.
| // Apply DEVICE_ROUTED MoE transformations to models | ||
| void apply_moe_device_routed_transforms(std::vector<std::shared_ptr<ov::Model>>& model_variants) { | ||
| LOG_INFO("Applying DEVICE_ROUTED MoE transformations..."); |
There was a problem hiding this comment.
Shouldn't this be moved to some MoE transformations file or something? Why does this transformation need to happen at top LLM level?
There was a problem hiding this comment.
Yes. It should be better to place this util in MoE transformations file.
DEVICE_ROUTED transformation is apply to LLM level model, then partitioner perform the partitioning.
Partitioner and runtime will treat DEVICE_ROUTED MoE as a traditional LLM.
This can avoid graph isolation to benefit TPS. (Avoid submission overhead)
@dmatveev
| if (npuw_llm_props.find("NPUW_LLM_GENERATE_HINT") == npuw_llm_props.end()) { | ||
| m_cfg.update({{"NPUW_LLM_GENERATE_HINT", "BEST_PERF"}}); | ||
| } |
There was a problem hiding this comment.
I think only DEVICE_ROUTED may work with BEST_PERF?
For HOST_ROUTED we still need the partitioning?
Should we force GENERATE_HINT here if and only if it is DEVICE_ROUTED?
MoE used to compile pretty fast in the past thanks to the partitioning. How long it would take if we force BEST_PERF here by default?
There was a problem hiding this comment.
ok.. so I think HOST_ROUTED case in the apply_moe_config would cancel this preset and select a partitioning pipeline?
I still have some concerns with enforcing BEST_PERF for DEVICE_ROUTED by default
There was a problem hiding this comment.
@dmatveev
For HOST_ROUTED, partitioning is working. We need it for further graph isolation, then perform transformation & execution.
But for DEVICE_ROUTED, enforcing BEST_PERF by default is aiming to achieve the best TPS.
I hope we can re-enable portioning by default for DEVICE_ROUTED once the TPS drop has be identified and solved.
BTW, it's not quite slow for compilation:
For the 1st round:
[ INFO ] Pipeline initialization time: 71.67s
For the 2nd+ rounds (CACHE_DIR is not set):
[ INFO ] Pipeline initialization time: 9.70s
| void TearDown() override {} | ||
|
|
||
| // Helper: Save model to XML for debugging | ||
| void save_model(const std::shared_ptr<Model>& model, const std::string& prefix) { |
There was a problem hiding this comment.
there may be "unused variable" warning otherwise
3f9ce5a
…otoolkit#33847) ### Details: **Background:** openvinotoolkit#33372 implemented `HOST_ROUTED` processing for MoE decoding. But the trivial submission overhead limits the decoding throughput. **Optimization:** This PR optimized MoE TPS with `DEVICE_ROUTED` processing: - Experts selection is performed dynamically on the device using `Gather` operations, avoiding graph splitting and reducing host-device overhead. - Infer execution is the same with traditional LLM. TPS can be improved from **12 t/s** to **17.9 t/s**. NPUW config: ``` { "NPUW_DEVICES" : "NPU", "MAX_PROMPT_LEN" : 1024, "NPUW_MOE_TOKEN_CHUNK_SIZE" : 0, "NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED", "NPUW_F16IC" : "YES", "NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES", "NPU_TURBO" : "YES", "NPUW_DUMP_SUBS" : "YES", "NPUW_DUMP_IO" : "NO", "NPU_COMPILER_TYPE" : "DRIVER" } ``` ### Tickets: - *[EISW-198089](https://jira.devtools.intel.com/browse/EISW-198089)* --------- Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Details:
Background:
#33372 implemented
HOST_ROUTEDprocessing for MoE decoding.But the trivial submission overhead limits the decoding throughput.
Optimization:
This PR optimized MoE TPS with
DEVICE_ROUTEDprocessing:Gatheroperations, avoiding graph splitting and reducing host-device overhead.TPS can be improved from 12 t/s to 17.9 t/s.
NPUW config:
Tickets: