[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED. by intelgaoxiong · Pull Request #33847 · openvinotoolkit/openvino

intelgaoxiong · 2026-01-28T05:52:17Z

Details:

Background:
#33372 implemented HOST_ROUTED processing for MoE decoding.
But the trivial submission overhead limits the decoding throughput.

Optimization:
This PR optimized MoE TPS with DEVICE_ROUTED processing:

Experts selection is performed dynamically on the device using Gather operations, avoiding graph splitting and reducing host-device overhead.
Infer execution is the same with traditional LLM.

TPS can be improved from 12 t/s to 17.9 t/s.

NPUW config:

{
	"NPUW_DEVICES" : "NPU",
	"MAX_PROMPT_LEN" : 1024,
	"NPUW_MOE_TOKEN_CHUNK_SIZE" : 0,
	"NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED",
	"NPUW_F16IC" : "YES",
	"NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES",
	"NPU_TURBO" : "YES",
	"NPUW_DUMP_SUBS" : "YES",
	"NPUW_DUMP_IO" : "NO",
	"NPU_COMPILER_TYPE" : "DRIVER"
}

Tickets:

EISW-198089

intelgaoxiong · 2026-02-03T07:25:51Z

#33924 is included.
Should be merged after #33924

Convert gather to 2D. Gather before convert. Keep gather indices as constant. Use JustInferRequest for DEVICE_ROUTED mode. Clean up transformations for DEVICE_ROUTED. Update config for DEVICE_ROUTED: BEST_PERF + not cut LM head. Refactor device routed transformation. Refactor GatherTo2DGather. Apply MoE defaults if not explicitly set in external config. Collect MoE nodes in single loop. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

multiply considered. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

intelgaoxiong · 2026-02-04T13:22:03Z

Rebased.

AlexanderKalistratov

idk looks fine for me.
But please wait review of others

AlexanderKalistratov · 2026-02-06T02:19:50Z

src/plugins/intel_npu/src/plugin/npuw/moe_transformations/device_routed_moe_transform.cpp

+                std::dynamic_pointer_cast<ov::op::v0::Constant>(tile->input_value(1).get_node_shared_ptr());
+            if (repeats_const) {
+                auto repeats_data = repeats_const->cast_vector<int64_t>();
+                if (!repeats_data.empty() && repeats_data[0] > k_value) {


So it is a possible situation that repeats_data[0] <= k_value?

In fact it's impossible.
In MoE IR, repeats_data[0] should be equal to total number of experts, k_value is the active experts number.

If repeats_data[0] <= k_value, I think this is a problematic MoE IR.

AlexanderKalistratov · 2026-02-06T02:50:15Z

src/plugins/intel_npu/src/plugin/npuw/moe_transformations/device_routed_moe_transform.cpp

+            } else {
+                // Constant reshape - check if dim 0 is expert dimension
+                auto shape_data = shape_const->cast_vector<int64_t>();
+                if (nodes.num_experts > 0 && !shape_data.empty() &&


nodes.num_experts > 0 implicitly assumes that we find Tile node first.

AlexanderKalistratov · 2026-02-06T02:57:13Z

src/plugins/intel_npu/src/plugin/npuw/moe_transformations/device_routed_moe_transform.cpp

+        }
+    }
+
+    return nodes;


So the whole function relies on some unobvious assumptions and layer names.
Why did you prefer it over MatcherPass and pattern matching?

Personally, I agree pattern matching is a more general approach.
But in practical, MoE pattern contains lot's of nodes, and I will implemented a lot of code to trace the nodes chain in the pattern with pattern matching approach.
Slight pattern change may cause failure.
Also not easy for reading / debugging...
So finally, I'm using a hybrid method, name matching + little pattern checking.

AlexanderKalistratov · 2026-02-06T03:04:28Z

src/plugins/intel_npu/src/plugin/npuw/moe_transformations/device_routed_moe_transform.cpp

+    }
+}
+
+void transform_dynamic_reshapes(LayerNodes& nodes) {


Why are we doing this?
Does it helps us later?

It's handling Reshape_2 in Expert graph whose targe_shape input is from Concat node instead of Constant.

For Constant - Reshape case, it's handled by transform_constant_reshapes to patch the constant value for shape compatibility.

For Concat - Reshape case, it's handled here by transform_dynamic_reshapes - convert it to unsqueeze for shape compatibility.

Otherwise, we will get compilation error.

dmatveev

Only reviewed the changes (not the existing code)

dmatveev · 2026-02-06T21:39:44Z

src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp

+    // Helper to check if a constant is MoE Gather indices (marked by GatherTo2DGather pass)
+    auto is_moe_gather_const = [](const CTPtr& const_node) -> bool {
+        const auto& rt_info = const_node->get_rt_info();
+        return rt_info.count("npuw_moe_gather_indices") > 0;
+    };


For the record - why the existing algorithm didn't work here to save the MoE indices? What makes them special?

@dmatveev Originally, we only keep tiny shape constant in function.
But Gather indices constant is not so "tiny" here.

dmatveev · 2026-02-06T21:42:25Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+// Apply DEVICE_ROUTED MoE transformations to models
+void apply_moe_device_routed_transforms(std::vector<std::shared_ptr<ov::Model>>& model_variants) {
+    LOG_INFO("Applying DEVICE_ROUTED MoE transformations...");


Shouldn't this be moved to some MoE transformations file or something? Why does this transformation need to happen at top LLM level?

Yes. It should be better to place this util in MoE transformations file.

DEVICE_ROUTED transformation is apply to LLM level model, then partitioner perform the partitioning.
Partitioner and runtime will treat DEVICE_ROUTED MoE as a traditional LLM.

This can avoid graph isolation to benefit TPS. (Avoid submission overhead)
@dmatveev

dmatveev · 2026-02-06T21:44:23Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+        if (npuw_llm_props.find("NPUW_LLM_GENERATE_HINT") == npuw_llm_props.end()) {
+            m_cfg.update({{"NPUW_LLM_GENERATE_HINT", "BEST_PERF"}});
+        }


I think only DEVICE_ROUTED may work with BEST_PERF?

For HOST_ROUTED we still need the partitioning?

Should we force GENERATE_HINT here if and only if it is DEVICE_ROUTED?

MoE used to compile pretty fast in the past thanks to the partitioning. How long it would take if we force BEST_PERF here by default?

ok.. so I think HOST_ROUTED case in the apply_moe_config would cancel this preset and select a partitioning pipeline?

I still have some concerns with enforcing BEST_PERF for DEVICE_ROUTED by default

@dmatveev
For HOST_ROUTED, partitioning is working. We need it for further graph isolation, then perform transformation & execution.
But for DEVICE_ROUTED, enforcing BEST_PERF by default is aiming to achieve the best TPS.
I hope we can re-enable portioning by default for DEVICE_ROUTED once the TPS drop has be identified and solved.

BTW, it's not quite slow for compilation:
For the 1st round:
[ INFO ] Pipeline initialization time: 71.67s
For the 2nd+ rounds (CACHE_DIR is not set):
[ INFO ] Pipeline initialization time: 9.70s

dmatveev · 2026-02-06T21:45:47Z

src/plugins/intel_npu/tests/unit/npuw/gather_to_2d_gather_test.cpp

+    void TearDown() override {}
+
+    // Helper: Save model to XML for debugging
+    void save_model(const std::shared_ptr<Model>& model, const std::string& prefix) {


there may be "unused variable" warning otherwise

…otoolkit#33847) ### Details: **Background:** openvinotoolkit#33372 implemented `HOST_ROUTED` processing for MoE decoding. But the trivial submission overhead limits the decoding throughput. **Optimization:** This PR optimized MoE TPS with `DEVICE_ROUTED` processing: - Experts selection is performed dynamically on the device using `Gather` operations, avoiding graph splitting and reducing host-device overhead. - Infer execution is the same with traditional LLM. TPS can be improved from **12 t/s** to **17.9 t/s**. NPUW config: ``` { "NPUW_DEVICES" : "NPU", "MAX_PROMPT_LEN" : 1024, "NPUW_MOE_TOKEN_CHUNK_SIZE" : 0, "NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED", "NPUW_F16IC" : "YES", "NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES", "NPU_TURBO" : "YES", "NPUW_DUMP_SUBS" : "YES", "NPUW_DUMP_IO" : "NO", "NPU_COMPILER_TYPE" : "DRIVER" } ``` ### Tickets: - *[EISW-198089](https://jira.devtools.intel.com/browse/EISW-198089)* --------- Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

github-actions bot added category: build OpenVINO cmake script / infra category: samples OpenVINO Runtime Samples category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin and removed category: samples OpenVINO Runtime Samples labels Jan 28, 2026

intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 2 times, most recently from d8f7978 to 10e6b84 Compare January 30, 2026 05:43

intelgaoxiong changed the title ~~[NPUW]DEVICE_ROUTED mode for MoE (GPT-OSS-20B) decoding on NPU.~~ [NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED. Jan 30, 2026

intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 10e6b84 to 97b9ea3 Compare January 31, 2026 02:55

github-actions bot removed the category: build OpenVINO cmake script / infra label Jan 31, 2026

intelgaoxiong marked this pull request as ready for review January 31, 2026 03:06

intelgaoxiong requested review from a team as code owners January 31, 2026 03:06

dmatveev added this to the 2026.1 milestone Feb 1, 2026

intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 3 times, most recently from 3dacd25 to 270410d Compare February 3, 2026 05:20

github-actions bot added the category: build OpenVINO cmake script / infra label Feb 3, 2026

intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 270410d to 7099966 Compare February 4, 2026 01:54

intelgaoxiong added 3 commits February 4, 2026 05:20

[DEVICE_ROUTED][AWQ]Fixed deviced_routed transformation with awq

2f696a3

multiply considered. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Add unit test for DEVICE_ROUTED and 2D gather transformation.

6fbaa79

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 7099966 to 6fbaa79 Compare February 4, 2026 13:21

intelgaoxiong requested a review from dmatveev February 5, 2026 00:17

esmirno self-requested a review February 5, 2026 12:28

AlexanderKalistratov approved these changes Feb 6, 2026

View reviewed changes

dmatveev reviewed Feb 6, 2026

View reviewed changes

dmatveev added this pull request to the merge queue Feb 6, 2026

Merged via the queue into openvinotoolkit:master with commit 3f9ce5a Feb 6, 2026
242 of 245 checks passed

dmatveev deleted the xiong/gpt-oss_device_routed branch February 6, 2026 23:21

Conversation

intelgaoxiong commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

intelgaoxiong commented Feb 3, 2026

Uh oh!

intelgaoxiong commented Feb 4, 2026

Uh oh!

AlexanderKalistratov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intelgaoxiong Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intelgaoxiong Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intelgaoxiong Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

intelgaoxiong commented Jan 28, 2026 •

edited

Loading

intelgaoxiong Feb 7, 2026 •

edited

Loading

intelgaoxiong Feb 7, 2026 •

edited

Loading

intelgaoxiong Feb 7, 2026 •

edited

Loading