Skip to content

add tileN = 8,16 for SM120 blockscale GEMM.#3292

Open
b8zhong wants to merge 1 commit into
NVIDIA:mainfrom
bzhng-development:brayden/sm120-tile-n-16
Open

add tileN = 8,16 for SM120 blockscale GEMM.#3292
b8zhong wants to merge 1 commit into
NVIDIA:mainfrom
bzhng-development:brayden/sm120-tile-n-16

Conversation

@b8zhong
Copy link
Copy Markdown

@b8zhong b8zhong commented Jun 2, 2026

image

It will be for use with SwapAB.

@b8zhong
Copy link
Copy Markdown
Author

b8zhong commented Jun 2, 2026

Hi @depaulmillz , I was wondering if you could take a look at this PR? Since I noticed you were the last one to add TileN = 32. Thanks!

@depaulmillz
Copy link
Copy Markdown
Contributor

Awesome. Have you been able to try with group GEMM as well?

@b8zhong
Copy link
Copy Markdown
Author

b8zhong commented Jun 3, 2026

@depaulmillz Yes. Technically, it works (this PR is also compatible with group GEMM changes as well). But for example when testing on two common cases, DSR1 TP = 8 and Qwen-3 MoE TP = 1, the speedup can only be 3-5% for BS = 1. So it's faster (as expected), but not by much.

@depaulmillz
Copy link
Copy Markdown
Contributor

It looks like you will need to add an assertion to prevent compiling ping-pong with MMA_N=8 which will expect a (2,2,1) layout shape for the MMA. I saw some ref check errors when testing the MR on pingpong MMA_N=8 kernels due to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants