Commit 41c5a96
committed
Add gfx950 (MI355X/CDNA4) to is_cdna() for correct Triton num_warps
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942),
but was missing from is_cdna(), causing all Triton kernels to use num_warps=32
(2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash.
Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021
by @danielhanchen, cherry-picked for MI355X validation.
Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1
- Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps
- Code RL GRPO (gpt-oss-20b BF16): 20/20 steps
- gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this)
- MoE expert LoRA + save_pretrained_merged: success1 parent 8729bb5 commit 41c5a96
1 file changed
+1
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
| 85 | + | |
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
| |||
0 commit comments