Skip to content

[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847

Merged
dmatveev merged 3 commits intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/gpt-oss_device_routed
Feb 6, 2026
Merged

[NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED.#33847
dmatveev merged 3 commits intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/gpt-oss_device_routed

Conversation

@intelgaoxiong
Copy link
Contributor

@intelgaoxiong intelgaoxiong commented Jan 28, 2026

Details:

Background:
#33372 implemented HOST_ROUTED processing for MoE decoding.
But the trivial submission overhead limits the decoding throughput.

Optimization:
This PR optimized MoE TPS with DEVICE_ROUTED processing:

  • Experts selection is performed dynamically on the device using Gather operations, avoiding graph splitting and reducing host-device overhead.
  • Infer execution is the same with traditional LLM.

TPS can be improved from 12 t/s to 17.9 t/s.

NPUW config:

{
	"NPUW_DEVICES" : "NPU",
	"MAX_PROMPT_LEN" : 1024,
	"NPUW_MOE_TOKEN_CHUNK_SIZE" : 0,
	"NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED",
	"NPUW_F16IC" : "YES",
	"NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES",
	"NPU_TURBO" : "YES",
	"NPUW_DUMP_SUBS" : "YES",
	"NPUW_DUMP_IO" : "NO",
	"NPU_COMPILER_TYPE" : "DRIVER"
}

Tickets:

@github-actions github-actions bot added category: build OpenVINO cmake script / infra category: samples OpenVINO Runtime Samples category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin and removed category: samples OpenVINO Runtime Samples labels Jan 28, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 2 times, most recently from d8f7978 to 10e6b84 Compare January 30, 2026 05:43
@intelgaoxiong intelgaoxiong changed the title [NPUW]DEVICE_ROUTED mode for MoE (GPT-OSS-20B) decoding on NPU. [NPUW]Optimize MoE (GPT-OSS-20B) TPS on NPU - DEVICE_ROUTED. Jan 30, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 10e6b84 to 97b9ea3 Compare January 31, 2026 02:55
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Jan 31, 2026
@intelgaoxiong intelgaoxiong marked this pull request as ready for review January 31, 2026 03:06
@intelgaoxiong intelgaoxiong requested review from a team as code owners January 31, 2026 03:06
@dmatveev dmatveev added this to the 2026.1 milestone Feb 1, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch 3 times, most recently from 3dacd25 to 270410d Compare February 3, 2026 05:20
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Feb 3, 2026
@intelgaoxiong
Copy link
Contributor Author

#33924 is included.
Should be merged after #33924

@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 270410d to 7099966 Compare February 4, 2026 01:54
Convert gather to 2D.

Gather before convert.

Keep gather indices as constant.

Use JustInferRequest for DEVICE_ROUTED mode.

Clean up transformations for DEVICE_ROUTED.

Update config for DEVICE_ROUTED: BEST_PERF + not cut LM head.

Refactor device routed transformation.

Refactor GatherTo2DGather.

Apply MoE defaults if not explicitly set in external config.

Collect MoE nodes in single loop.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
multiply considered.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
@intelgaoxiong intelgaoxiong force-pushed the xiong/gpt-oss_device_routed branch from 7099966 to 6fbaa79 Compare February 4, 2026 13:21
@intelgaoxiong
Copy link
Contributor Author

Rebased.

@esmirno esmirno self-requested a review February 5, 2026 12:28
Copy link
Contributor

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk looks fine for me.
But please wait review of others

std::dynamic_pointer_cast<ov::op::v0::Constant>(tile->input_value(1).get_node_shared_ptr());
if (repeats_const) {
auto repeats_data = repeats_const->cast_vector<int64_t>();
if (!repeats_data.empty() && repeats_data[0] > k_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is a possible situation that repeats_data[0] <= k_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact it's impossible.
In MoE IR, repeats_data[0] should be equal to total number of experts, k_value is the active experts number.

If repeats_data[0] <= k_value, I think this is a problematic MoE IR.

} else {
// Constant reshape - check if dim 0 is expert dimension
auto shape_data = shape_const->cast_vector<int64_t>();
if (nodes.num_experts > 0 && !shape_data.empty() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes.num_experts > 0 implicitly assumes that we find Tile node first.

}
}

return nodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whole function relies on some unobvious assumptions and layer names.
Why did you prefer it over MatcherPass and pattern matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I agree pattern matching is a more general approach.
But in practical, MoE pattern contains lot's of nodes, and I will implemented a lot of code to trace the nodes chain in the pattern with pattern matching approach.
Slight pattern change may cause failure.
Also not easy for reading / debugging...
So finally, I'm using a hybrid method, name matching + little pattern checking.

}
}

void transform_dynamic_reshapes(LayerNodes& nodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?
Does it helps us later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's handling Reshape_2 in Expert graph whose targe_shape input is from Concat node instead of Constant.
image

For Constant - Reshape case, it's handled by transform_constant_reshapes to patch the constant value for shape compatibility.

For Concat - Reshape case, it's handled here by transform_dynamic_reshapes - convert it to unsqueeze for shape compatibility.

Otherwise, we will get compilation error.

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the changes (not the existing code)

Comment on lines +1414 to +1418
// Helper to check if a constant is MoE Gather indices (marked by GatherTo2DGather pass)
auto is_moe_gather_const = [](const CTPtr& const_node) -> bool {
const auto& rt_info = const_node->get_rt_info();
return rt_info.count("npuw_moe_gather_indices") > 0;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record - why the existing algorithm didn't work here to save the MoE indices? What makes them special?

Copy link
Contributor Author

@intelgaoxiong intelgaoxiong Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev Originally, we only keep tiny shape constant in function.
But Gather indices constant is not so "tiny" here.

Comment on lines +1397 to +1399
// Apply DEVICE_ROUTED MoE transformations to models
void apply_moe_device_routed_transforms(std::vector<std::shared_ptr<ov::Model>>& model_variants) {
LOG_INFO("Applying DEVICE_ROUTED MoE transformations...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be moved to some MoE transformations file or something? Why does this transformation need to happen at top LLM level?

Copy link
Contributor Author

@intelgaoxiong intelgaoxiong Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It should be better to place this util in MoE transformations file.

DEVICE_ROUTED transformation is apply to LLM level model, then partitioner perform the partitioning.
Partitioner and runtime will treat DEVICE_ROUTED MoE as a traditional LLM.

This can avoid graph isolation to benefit TPS. (Avoid submission overhead)
@dmatveev

Comment on lines +1630 to +1632
if (npuw_llm_props.find("NPUW_LLM_GENERATE_HINT") == npuw_llm_props.end()) {
m_cfg.update({{"NPUW_LLM_GENERATE_HINT", "BEST_PERF"}});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only DEVICE_ROUTED may work with BEST_PERF?

For HOST_ROUTED we still need the partitioning?

Should we force GENERATE_HINT here if and only if it is DEVICE_ROUTED?

MoE used to compile pretty fast in the past thanks to the partitioning. How long it would take if we force BEST_PERF here by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. so I think HOST_ROUTED case in the apply_moe_config would cancel this preset and select a partitioning pipeline?

I still have some concerns with enforcing BEST_PERF for DEVICE_ROUTED by default

Copy link
Contributor Author

@intelgaoxiong intelgaoxiong Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev
For HOST_ROUTED, partitioning is working. We need it for further graph isolation, then perform transformation & execution.
But for DEVICE_ROUTED, enforcing BEST_PERF by default is aiming to achieve the best TPS.
I hope we can re-enable portioning by default for DEVICE_ROUTED once the TPS drop has be identified and solved.

BTW, it's not quite slow for compilation:
For the 1st round:
[ INFO ] Pipeline initialization time: 71.67s
For the 2nd+ rounds (CACHE_DIR is not set):
[ INFO ] Pipeline initialization time: 9.70s

void TearDown() override {}

// Helper: Save model to XML for debugging
void save_model(const std::shared_ptr<Model>& model, const std::string& prefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be "unused variable" warning otherwise

@dmatveev dmatveev added this pull request to the merge queue Feb 6, 2026
Merged via the queue into openvinotoolkit:master with commit 3f9ce5a Feb 6, 2026
242 of 245 checks passed
@dmatveev dmatveev deleted the xiong/gpt-oss_device_routed branch February 6, 2026 23:21
Naseer-010 pushed a commit to Naseer-010/openvino that referenced this pull request Feb 18, 2026
…otoolkit#33847)

### Details:
**Background:**
openvinotoolkit#33372 implemented
`HOST_ROUTED` processing for MoE decoding.
But the trivial submission overhead limits the decoding throughput.

**Optimization:**
This PR optimized MoE TPS with `DEVICE_ROUTED` processing:
- Experts selection is performed dynamically on the device using
`Gather` operations, avoiding graph splitting and reducing host-device
overhead.
- Infer execution is the same with traditional LLM.

TPS can be improved from **12 t/s** to **17.9 t/s**.

NPUW config:
```
{
	"NPUW_DEVICES" : "NPU",
	"MAX_PROMPT_LEN" : 1024,
	"NPUW_MOE_TOKEN_CHUNK_SIZE" : 0,
	"NPUW_LLM_GENERATE_MOE_HINT" : "DEVICE_ROUTED",
	"NPUW_F16IC" : "YES",
	"NPUW_LLM_OPTIMIZE_V_TENSORS" : "YES",
	"NPU_TURBO" : "YES",
	"NPUW_DUMP_SUBS" : "YES",
	"NPUW_DUMP_IO" : "NO",
	"NPU_COMPILER_TYPE" : "DRIVER"
}
```

### Tickets:
 - *[EISW-198089](https://jira.devtools.intel.com/browse/EISW-198089)*

---------

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants