Skip to content

Commit 41c5a96

Browse files
Add gfx950 (MI355X/CDNA4) to is_cdna() for correct Triton num_warps
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021 by @danielhanchen, cherry-picked for MI355X validation. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1 - Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps - Code RL GRPO (gpt-oss-20b BF16): 20/20 steps - gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this) - MoE expert LoRA + save_pretrained_merged: success
1 parent 8729bb5 commit 41c5a96

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

unsloth/kernels/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ def is_cdna():
8282
"gfx940",
8383
"gfx941",
8484
"gfx942",
85+
"gfx950", # CDNA4 (MI350/MI355X)
8586
)
8687

8788

0 commit comments

Comments
 (0)