Commit 0981bbf
authored
[webgpu] Optimize matmulnbits with M > 1 (#23102)
This is the webgpu native ep implementation of #23092.
I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to
print the first token time.
The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
Decoding first token with input 449 tokens: 13.0 sec
Decoding remaining 210 tokens:
11.8 sec
17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
Decoding first token with input 449 tokens: 7.3 sec
Decoding remaining 210 tokens:
7.0 sec
29.81 tokens/sec
```
-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
Decoding first token with input 449 tokens: 8.5 sec
Decoding remaining 208 tokens:
12.1 sec
17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
Decoding first token with input 449 tokens: 4.1 sec
Decoding remaining 210 tokens:
7.2 sec
28.98 tokens/sec
```
From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.1 parent 9115682 commit 0981bbf
File tree
2 files changed
+151
-242
lines changed- onnxruntime/contrib_ops/webgpu/quantization
2 files changed
+151
-242
lines changed
0 commit comments