Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886eble-amd wants to merge 1 commit into
Conversation
|
d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here? |
|
The effect of the delay might be the same as the effect of staggering the start addresses of each block |
Unless I'm looking at the wrong thing, this MR bumps it from 16 to 32. |
When the int4 weight matrix exceeds L2 cache, wider memory loads (ACHUNK=32 vs 16) improve bandwidth on the wvSplitK_int4_g kernel. The L2 size is queried at runtime via hipDeviceProp, so the threshold adapts to different GPUs. Measured on Radeon 8060S (gfx1151, 2 MiB L2): - 1x2048x16384: 141 -> 149 GiB/s (+5%) - 1x2560x2048: 162 -> 166 GiB/s (+2%) - 1x32768x2048: 199 -> 200 GiB/s (+1%) - Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline Signed-off-by: Dan Eble <Dan.Eble@amd.com>
78a263f to
acbca20
Compare
Since we discussed (elsewhere) spending time to dig deeper into why the staggering is helping, I removed the staggering commit from this PR so that it doesn't delay merging the other improvement. |
Purpose
Improve GEMV performance on Radeon 8060S and similar GPUs.
Test Plan
Test Results
Copied from commit message:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.