Skip to content

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886

Open
eble-amd wants to merge 1 commit into
ROCm:gfx11from
eble-amd:skinny-int4-perf
Open

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
eble-amd wants to merge 1 commit into
ROCm:gfx11from
eble-amd:skinny-int4-perf

Conversation

@eble-amd
Copy link
Copy Markdown

@eble-amd eble-amd commented Apr 17, 2026

Purpose

Improve GEMV performance on Radeon 8060S and similar GPUs.

Test Plan

  • vllm benchmark with Gemma 2B AWQ
  • pytest performance tests (new golden values included)

Test Results

Copied from commit message:

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 141 -> 149 GiB/s (+5%)
- 1x2560x2048:  162 -> 166 GiB/s (+2%)
- 1x32768x2048: 199 -> 200 GiB/s (+1%)
- Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@eble-amd eble-amd changed the title Skinny int4 perf Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs Apr 20, 2026
@mgehre-amd
Copy link
Copy Markdown

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

@mgehre-amd
Copy link
Copy Markdown

The effect of the delay might be the same as the effect of staggering the start addresses of each block

@eble-amd
Copy link
Copy Markdown
Author

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

Unless I'm looking at the wrong thing, this MR bumps it from 16 to 32.

When the int4 weight matrix exceeds L2 cache, wider memory loads
(ACHUNK=32 vs 16) improve bandwidth on the wvSplitK_int4_g kernel.  The
L2 size is queried at runtime via hipDeviceProp, so the threshold adapts
to different GPUs.

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 141 -> 149 GiB/s (+5%)
- 1x2560x2048:  162 -> 166 GiB/s (+2%)
- 1x32768x2048: 199 -> 200 GiB/s (+1%)
- Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd eble-amd force-pushed the skinny-int4-perf branch from 78a263f to acbca20 Compare May 29, 2026 21:16
@eble-amd
Copy link
Copy Markdown
Author

The effect of the delay might be the same as the effect of staggering the start addresses of each block

Since we discussed (elsewhere) spending time to dig deeper into why the staggering is helping, I removed the staggering commit from this PR so that it doesn't delay merging the other improvement.

@eble-amd eble-amd marked this pull request as ready for review May 29, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants