[js/webgpu] Optimize matmulnbits with M > 1 by qjia7 · Pull Request #23092 · microsoft/onnxruntime

qjia7 · 2024-12-12T11:06:19Z

Description

This PR is mainly to optimize decoding first token in phi3 model. For the first token, the matmulnbits is a matrix * matrix, which is very slow when the prompt message is very long, like inputs tokens > 450.

Both intel GPUs and NV GPUs see good improvement with this PR.

My test data is like below:
Decoding first token with input 499 tokens:
It becomes 4.9s from 6.2s on NV RTX 2000.
It becomes 28.8s from 52.6s on Intel UHD 770.

qjia7 · 2024-12-12T11:16:38Z

@guschmue @sushraja-msft FYI. I notice @sushraja-msft already did some very good optimizations on webgpu native ep. I will port this PR to webgpu ep to make some comparisons.

sushanthr · 2024-12-13T12:35:22Z

@guschmue @sushraja-msft FYI. I notice @sushraja-msft already did some very good optimizations on webgpu native ep. I will port this PR to webgpu ep to make some comparisons.

Thanks JiaJia, jfyi there is this pending change as well #23071 that improves on that previous change. Looking forward to comparing performance.

This is the webgpu native ep implementation of #23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

guschmue · 2024-12-17T04:52:34Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

azure-pipelines · 2024-12-17T04:52:47Z

Azure Pipelines successfully started running 1 pipeline(s).

guschmue · 2024-12-17T04:52:48Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

guschmue · 2024-12-17T04:52:59Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2024-12-17T04:52:59Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

azure-pipelines · 2024-12-17T04:53:12Z

Azure Pipelines successfully started running 1 pipeline(s).

guschmue · 2024-12-17T04:55:33Z

/azp run Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-17T04:55:40Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

This is the webgpu native ep implementation of #23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

This is the webgpu native ep implementation of microsoft#23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

Optimize matmulnbits with M > 1

d38be76

qjia7 mentioned this pull request Dec 13, 2024

[webgpu] Optimize matmulnbits with M > 1 #23102

Merged

guschmue added the ep:WebGPU ort-web webgpu provider label Dec 16, 2024

qjia7 closed this May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Optimize matmulnbits with M > 1#23092

[js/webgpu] Optimize matmulnbits with M > 1#23092
qjia7 wants to merge 1 commit intomicrosoft:mainfrom
qjia7:js-matmulnbits

qjia7 commented Dec 12, 2024

Uh oh!

qjia7 commented Dec 12, 2024

Uh oh!

sushanthr commented Dec 13, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Dec 12, 2024

Description

Uh oh!

qjia7 commented Dec 12, 2024

Uh oh!

sushanthr commented Dec 13, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

guschmue commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants