You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[webgpu] Optimize string stream used in WebGPU EP (#27223)
### Description
Optimize the string stream used in WebGPU EP.
### Motivation and Context
The current implementation uses a `absl::OStringStream`, which is faster
than `std::ostringstream`. However, it is still slow in the usage of
generating the program cache key.
From the profiling data, `CalculateProgramCacheKey()` is extremely time
consuming. It can consume up to 1/3 of all CPU time inside
`WebGpuContext::Run()`:
<img width="1035" height="185" alt="image"
src="https://github.com/user-attachments/assets/5b9e33cc-cd0a-4ef8-9a92-2ee894b85156"
/>
The basic analyze shows that most time spent in the `std::basic_ostream
operator <<()` implementation, and this is way slower than expected.
To optimize, this PR uses a simplified implementation
`FastOStringStream`, which does not inherit from `std::basic_ostream`.
Instead, the class implementation only includes necessary overrides for
the minimum requirements of generating cache key and shader code, to
reduce the unnecessary overhead as much as possible.
<img width="1016" height="156" alt="image"
src="https://github.com/user-attachments/assets/32e3d345-c56d-4e6d-89e1-99cc7b150d8e"
/>
As a result, the CPU sampling of `CalculateProgramCacheKey()` in the
same test dropped from 2555 to 176. Generation TPS of E2E model
benchmark on Qwen3-0.6B increased from ~90 to ~130 on
Windows11/13900k/RTX4070.
0 commit comments