ROCm: default GPT-OSS to BF16 and disable AITER by danielhanchen · Pull Request #4021 · unslothai/unsloth

danielhanchen · 2026-02-10T16:33:50Z

Summary

Default GPT-OSS model selection to BF16 on HIP to avoid MXFP4 and prequantized blocksize issues
Disable AITER and ROCm RoPE backend by default on HIP to avoid build locks and runtime faults

Testing

gpt-oss-(20B)-GRPO.ipynb (30 steps)
gpt-oss-(20B)-Fine-tuning.ipynb (30 steps)
Gemma3_(4B)-Vision.ipynb (30 steps)
Llama3.2_(1B_and_3B)-Conversational.ipynb (60 steps)

gemini-code-assist · 2026-02-10T16:34:04Z

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the stability and compatibility for users on ROCm/HIP platforms by addressing known issues with model quantization and specific backend functionalities. It ensures that GPT-OSS models load in a more robust BF16 format by default and proactively disables problematic AITER and ROCm RoPE backends to prevent system instability, providing a smoother experience for users.

Highlights

ROCm GPT-OSS Default to BF16: GPT-OSS model selection on HIP devices now defaults to BF16 to mitigate issues with MXFP4 and prequantized block sizes, ensuring more stable model loading.
AITER and ROCm RoPE Backend Disabled by Default: AITER and the ROCm RoPE backend are now disabled by default on HIP to prevent potential JIT build locks and runtime faults. Users can override these settings via environment variables.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth/device_type.py
- Imported the 'os' module.
- Added default environment variable settings to disable AITER and the ROCm RoPE backend when running on HIP devices.
unsloth/models/loader.py
- Introduced conditional logic to force GPT-OSS models to load as BF16 on HIP devices if an exact model name is not specified, overriding other quantization settings.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces defaults for ROCm environments to improve stability and avoid issues with certain models. Specifically, it disables AITER to prevent build locks and runtime faults, and it defaults GPT-OSS models to their BF16 versions to work around MXFP4 and pre-quantization problems. The changes are sensible, but I've suggested a small improvement to make the model selection logic for GPT-OSS more robust.

gemini-code-assist · 2026-02-10T16:35:42Z

unsloth/models/loader.py

+                if "120b" in model_name.lower():
+                    model_name = "unsloth/gpt-oss-120b-BF16"
+                else:
+                    model_name = "unsloth/gpt-oss-20b-BF16"


The current logic for selecting the BF16 version of a GPT-OSS model assumes that any model not containing '120b' is a '20b' model. This is not robust and could lead to incorrectly loading the 20b model for other sizes (e.g., a hypothetical 60b model). I suggest making this logic more explicit by handling the '20b' case and adding a warning for any other unhandled sizes.

Suggested change

if "120b" in model_name.lower():

model_name = "unsloth/gpt-oss-120b-BF16"

else:

model_name = "unsloth/gpt-oss-20b-BF16"

if "120b" in model_name.lower():

model_name = "unsloth/gpt-oss-120b-BF16"

elif "20b" in model_name.lower():

model_name = "unsloth/gpt-oss-20b-BF16"

else:

logger.warning_once(

f"Unsloth: gpt-oss model '{model_name}' is not explicitly supported for BF16 defaulting on ROCm. "

f"Defaulting to 'unsloth/gpt-oss-20b-BF16', which might be incorrect."

)

model_name = "unsloth/gpt-oss-20b-BF16"

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e1e24cc02

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-10T16:36:07Z

unsloth/models/loader.py

+                    model_name = "unsloth/gpt-oss-120b-BF16"
+                else:
+                    model_name = "unsloth/gpt-oss-20b-BF16"


Preserve non-base GPT-OSS IDs on HIP

This rewrite forces any HIP model_name that contains gpt-oss into one of two base BF16 checkpoints, so with default use_exact_model_name=False it silently changes explicit variants (for example unsloth/gpt-oss-safeguard-20b/120b in unsloth/models/mapper.py) and custom fine-tuned IDs or paths into unsloth/gpt-oss-20b-BF16 or -120b-BF16. In those cases users load a different model than requested and can lose safeguard behavior without an error, so the remap should be restricted to the base GPT-OSS IDs (or fail clearly for unsupported variants).

Useful? React with 👍 / 👎.

for more information, see https://pre-commit.ci

danielhanchen · 2026-02-11T05:05:56Z

Validation update for ROCm:

Pushed commit 3daaff6b to guard patch_gradient_accumulation_fix when generated _unsloth___init__ is missing.
Revalidated with temp/run_229_verify_pr4021_llama3_2_conv on ROCm (Llama3.2_1B_and_3B_Conversational.py, 60 steps).
Training completed with sane loss trend (train_loss=0.8569) and inference outputs remained coherent.
Confirmed no leftover run process and temporary outputs/*lora* artifacts were cleaned automatically.

NVIDIA compatibility notes:

Existing non-ROCm behavior in make_fast_generate_wrapper remains unchanged via DEVICE_TYPE == "hip" guards.
The new init patch guard is platform-agnostic and only changes failure handling from hard crash to safe skip when source patch generation fails.

danielhanchen · 2026-02-11T05:53:23Z

Recheck update for PR #4021 against current ROCm notebook reruns.

Re-evaluated items

AITER defaults in device_type (AITER_DISABLE=1 and USE_ROCM_AITER_ROPE_BACKEND=0 via setdefault).
Trainer init patch guard in models/_utils.py (local_scope + generated function fallback).
TorchAO Int4WeightOnlyConfig(version=2) with TypeError fallback.

Findings

AITER override path works correctly: explicit env override still works and does not get clobbered.
- Run: temp/run_236_aiter_on_llama3_2_conv
- Result: SUCCESS with sane inference output.
Trainer init guard remains defensive only; no regression observed in reruns.
TorchAO fallback remains compatible with current torchao and protects older versions.

Action

No additional unsloth code changes from this recheck cycle were required beyond the current PR branch state.
New runtime failures in GPT-OSS GRPO were addressed in unsloth-zoo PR RuntimeError: Unsloth: Quantization failed for ./model-unsloth.F16.gguf #494 and revalidated there.

for more information, see https://pre-commit.ci

danielhanchen · 2026-02-11T06:14:31Z

Follow-up adjustment: HIP GPT-OSS routing is now capability-based, not blanket BF16.

What changed

Updated unsloth/models/loader.py in both loader paths (FastLanguageModel.from_pretrained and FastModel.from_pretrained).
Previous behavior forced BF16 for HIP GPT-OSS unconditionally.
New behavior checks device capability flags from device_type.py:
- ALLOW_BITSANDBYTES
- ALLOW_PREQUANTIZED_MODELS
BF16 fallback now applies only when prequantized GPT-OSS is not usable.

Logic

wants_prequantized = load_in_4bit or model_name has bnb-4bit suffix
can_use_prequantized = ALLOW_BITSANDBYTES and ALLOW_PREQUANTIZED_MODELS
Fallback to BF16 only if not (wants_prequantized and can_use_prequantized).

Why

Radeon setups can support prequantized GPT-OSS 4bit.
Instinct/MI-class setups can require BF16 fallback.
This keeps behavior aligned with device_type.py capability detection.

Validation (ROCm)

temp/run_242_postloadercap_gpt_oss_grpo: SUCCESS
- losses [0.0, 0.0]
- grad norms [0.264757..., 3.079521...]
- reward columns populated and varying.
temp/run_243_postloadercap_llama3_2_conv: SUCCESS
- losses [0.6471, 0.6471, 0.6443, 0.6324, 0.5953] (decreasing)
- grad norms logged and stable.

Note

On this test machine, ALLOW_PREQUANTIZED_MODELS=False, so BF16 fallback remains active here as expected.
The new gating enables prequantized path automatically on HIP devices where capability flags permit it.

Commit

24c7f2ee

danielhanchen · 2026-02-11T06:49:06Z

Recheck complete. I cleaned up the PR to remove unintended fallback behavior that made the diff look botched.\n\nWhat I removed\n- HIP-specific fallback in that silently downgraded when is missing.\n- HIP-specific behavior in that auto-dropped / and auto-tokenized string inputs.\n\nWhat remains intentionally in this PR\n- : ROCm AITER defaults via so explicit user env overrides still win.\n- : capability-gated GPT-OSS BF16 fallback using + (Radeon can still use prequantized checkpoints where supported).\n- : defensive Trainer init patch guard + TorchAO with TypeError fallback for older TorchAO.\n\nCommit pushed\n- \n\nI will continue the AMD notebook reruns and keep logging outcomes in and .

danielhanchen · 2026-02-11T06:49:22Z

Recheck complete. I cleaned up the PR to remove unintended fallback behavior that made the diff look botched.

What I removed

HIP-specific fallback in loader.py that silently downgraded fast_inference=True when vllm is missing.
HIP-specific behavior in make_fast_generate_wrapper that auto-dropped sampling_params / lora_request and auto-tokenized string inputs.

What remains intentionally in this PR

device_type.py: ROCm AITER defaults via setdefault so explicit user env overrides still win.
models/loader.py: capability-gated GPT-OSS BF16 fallback using ALLOW_BITSANDBYTES + ALLOW_PREQUANTIZED_MODELS (Radeon can still use prequantized checkpoints where supported).
models/_utils.py: defensive Trainer init patch guard + TorchAO Int4WeightOnlyConfig(version=2) with TypeError fallback for older TorchAO.

Commit pushed

a211e8c7

I will continue the AMD notebook reruns and keep logging outcomes in RUN_DETAILS.csv and TRANSCRIPT.md.

danielhanchen · 2026-02-11T07:19:33Z

Applied a fresh cleanup/refactor for the duplicated HIP GPT-OSS routing block in loader.py.

What changed

Introduced shared helper: _route_hip_gpt_oss_model(...).
Replaced both inline copies with helper calls in:
- FastLanguageModel.from_pretrained
- FastModel.from_pretrained

Behavior

No routing logic changes intended.
Same capability-gated behavior (ALLOW_BITSANDBYTES + ALLOW_PREQUANTIZED_MODELS) is preserved.
Kept both call sites because both loader entrypoints are valid and used.

Validation

python -m py_compile unsloth/models/loader.py passed in ROCm container.

Commit

b56ab6ae

danielhanchen · 2026-02-11T07:37:54Z

Follow-up cleanup applied:

Moved _route_hip_gpt_oss_model(...) to the footer of unsloth/models/loader.py for readability.
No behavior changes; call sites remain unchanged.

Safety note

This does not break runtime resolution in Python for this case because the helper is looked up when from_pretrained executes, after module import completes.

Validation

python -m py_compile unsloth/models/loader.py passed in ROCm container.
Import check passed with helper present (helper_end_exists=True).

Commit

4f138acb

danielhanchen · 2026-02-11T08:12:34Z

Quick status update from the latest ROCm notebook cycle.

No new unsloth code changes were required in this pass.
The newly identified blocker was in GRPO VLM log-prob handling and was fixed in unsloth-zoo PR RuntimeError: Unsloth: Quantization failed for ./model-unsloth.F16.gguf #494 (c312e59).
Verification runs this cycle:
- temp/run_261_Qwen3_VL_8B_Vision_nocompile_rerun: SUCCESS
- temp/run_267_Gemma3_4B_Vision_GRPO_nocompile_no_flex_logitsfix: SUCCESS (30/30 GRPO steps)

I will continue the remaining notebook sweep and report any unsloth-side regressions if they appear.

danielhanchen · 2026-02-11T12:14:28Z

Added follow-up AMD notebook stability fixes:\n\n- Commit: \n- Files:\n - \n - \n - \n\nWhat changed:\n- Added definitions for HIP/XPU branches so downstream checks have a stable callable.\n- Hooked Deepseek OCR patch invocation in with a guarded import/call path.\n- Added an offline GGUF guard path in () so Ollama notebook flows do not fail hard when GGUF conversion dependencies are unavailable.\n\nValidation evidence (ROCm runs):\n- Deepseek OCR fix path: (fail) -> (success).\n- Ollama flow: (success).\n- Full notebook latest status remains green in after this cycle (all 25 tracked notebooks currently latest=SUCCESS).

danielhanchen · 2026-02-11T12:14:39Z

Follow-up AMD notebook stability fixes are now pushed.

Commit: 9956d1d4
Files:
- unsloth/__init__.py
- unsloth/models/vision.py
- unsloth/save.py

What changed:

Added is_bf16_supported() definitions for HIP/XPU branches so downstream checks have a stable callable.
Hooked Deepseek OCR patch invocation in FastVisionModel.from_pretrained() with a guarded import/call path.
Added an offline GGUF guard path in save.py (UNSLOTH_GGUF_OFFLINE=1) so Ollama notebook flows do not fail hard when GGUF conversion dependencies are unavailable.

Validation evidence (ROCm runs):

Deepseek OCR fix path: temp/run_247_Deepseek_OCR_3B (fail) -> temp/run_249_Deepseek_OCR_3B_rerun (success)
Ollama flow: temp/run_271_Llama3_8B_Ollama (success)
Full tracked notebook status is now green in RUN_DETAILS.csv (latest row per notebook = SUCCESS for all 25 notebooks)

for more information, see https://pre-commit.ci

danielhanchen · 2026-02-11T12:16:56Z

Additional follow-up pushed:

Commit: f0da8260
File: unsloth/kernels/utils.py

Change:

Reinitialize dequant global weight buffer when dtype changes (WEIGHT_BUFFER is None or WEIGHT_BUFFER.dtype != dtype).
This avoids mixed bf16/fp16 reuse mismatch in global dequant buffers.

Why this was added:

Re-review against PR Fix global dequantize buffer dtype mismatch across mixed-precision loads #4026 showed the dtype-guard is a safe generic correctness fix and does not rely on HIP-only behavior.
It is hardware-agnostic and should improve stability for mixed precision load/inference paths on both AMD and NVIDIA.

Note on PR #4029:

I used --disable-compile in the Gemma3N notebook harness for targeted triage and validation.
I did not add a blanket HIP compile-disable in library code in this branch to avoid over-broad behavior changes.

danielhanchen · 2026-02-11T13:35:19Z

Superseding my prior malformed CLI comment with corrected content.

Re-review update with fresh A/B checks on ROCm:

Removed redundant Deepseek OCR patch invocation in unsloth/models/vision.py (commit 734649e4).

The Deepseek fix is already applied via temporary patches initialization in unsloth-zoo, so this loader-level call was duplicate.

Validation evidence

temp/run_316_Deepseek_OCR_prunecheck (Deepseek masked_scatter patch removed): FAILED
- RuntimeError: size mismatch in masked_scatter path.
temp/run_317_Deepseek_OCR_prunecheck_retry (Deepseek masked_scatter patch restored, loader duplicate call removed): SUCCESS.
temp/run_320_Llama3_2_conv_recheck: SUCCESS.

Current assessment

Keep the Deepseek masked_scatter patch itself (required for Deepseek OCR inference path).
Remove duplicate invocation in vision.py (now done).
This cleanup commit does not introduce NVIDIA-specific behavior.

Additional note from sanity run:

temp/run_321_gpt_oss_20B_GRPO_recheck failed with CPU/CUDA embedding index device mismatch in a notebook generate path (embed_tokens offload interaction). This is separate from the duplicate-call cleanup and needs a dedicated fix path.

@danielhanchen

MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021 by @danielhanchen, cherry-picked for MI355X validation. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1 - Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps - Code RL GRPO (gpt-oss-20b BF16): 20/20 steps - gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this) - MoE expert LoRA + save_pretrained_merged: success

ROCm: default GPT-OSS to BF16 and disable AITER

4e1e24c

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 10, 2026

View reviewed changes

danielhanchen and others added 2 commits February 11, 2026 05:04

ROCm: guard Trainer init patch against missing generated function

3daaff6

[pre-commit.ci] auto fixes from pre-commit.com hooks

cff534d

for more information, see https://pre-commit.ci

danielhanchen and others added 2 commits February 11, 2026 06:14

ROCm GPT-OSS: gate BF16 fallback by prequant capability

24c7f2e

[pre-commit.ci] auto fixes from pre-commit.com hooks

458af41

for more information, see https://pre-commit.ci

ROCm: trim unintended fast-inference fallback behaviors

a211e8c

Refactor HIP GPT-OSS routing into shared loader helper

b56ab6a

Move HIP GPT-OSS routing helper to loader footer

4f138ac

ROCm notebook stability: deepseek OCR hook + offline GGUF guard

9956d1d

pre-commit-ci bot and others added 2 commits February 11, 2026 12:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

28aa6c2

for more information, see https://pre-commit.ci

Fix dequant global buffer dtype reuse across mixed precision

f0da826

Remove redundant Deepseek OCR patch call from vision loader

734649e

This was referenced Feb 14, 2026

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes #4050

Closed

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() #4051

Merged

Uh oh!

Comments

Conversation

danielhanchen commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

danielhanchen commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented Feb 10, 2026 •

edited

Loading