Add gfx950 (MI355X/CDNA4) to is_cdna() for correct Triton num_warps

GoldenGrapeGentleman · GoldenGrapeGentleman · commit 41c5a9639fe0 · 2026-02-14T04:15:37.000-06:00
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021 by @danielhanchen, cherry-picked for MI355X validation. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1 - Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps - Code RL GRPO (gpt-oss-20b BF16): 20/20 steps - gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this) - MoE expert LoRA + save_pretrained_merged: success
diff --git a/unsloth/kernels/utils.py b/unsloth/kernels/utils.py
@@ -82,6 +82,7 @@ def is_cdna():
         "gfx940",
         "gfx941",
         "gfx942",
+        "gfx950",  # CDNA4 (MI350/MI355X)
     )
 
 

Original file line number	Diff line number	Diff line change
`@@ -82,6 +82,7 @@ def is_cdna():`
`82`	`82`	`"gfx940",`
`83`	`83`	`"gfx941",`
`84`	`84`	`"gfx942",`
	`85`	`+ "gfx950", # CDNA4 (MI350/MI355X)`
`85`	`86`	`)`
`86`	`87`
`87`	`88`