Skip to content

Conversation

@isharif168
Copy link

SUMMARY:
Add code to convert the gpt_oss model and then quantize model to w8a8

TEST PLAN:
Tested with gpt_oss model and run using vllm

Sample output

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:

Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
Supportive design
We should provide a short answer.assistantfinalHere are three reasons to use AI:

Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
finish_reason: stop
num_tokens: 226

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @isharif168, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables the quantization of GPT-OSS models by introducing a mechanism to linearize their Mixture-of-Experts (MoE) layers. By converting the fused expert weights into separate linear projections, the models become compatible with the LLM Compressor's W4A8 quantization scheme. A new example script is provided to illustrate this process, allowing users to quantize the openai/gpt-oss-20b model for reduced memory footprint and faster inference.

Highlights

  • GPT-OSS Model Linearization: Introduced new classes LinearExpert and LinearExperts in src/llmcompressor/modeling/gpt_oss.py to refactor the fused Mixture-of-Experts (MoE) layers found in GPT-OSS models. This converts the original fused expert weights into distinct nn.Linear projections (gate, up, and down), making them compatible with standard quantization techniques.
  • Conversion Utility for Quantization: Added convert_model_for_quantization_gptoss function, which performs an in-place conversion of a GPT-OSS model by replacing its fused MoE expert blocks with the newly defined LinearExperts structure. This prepares the model for efficient quantization.
  • W4A8 Quantization Example: Included a new example script examples/quantization_w4a8/gpt_oss_20b_example.py that demonstrates how to load the openai/gpt-oss-20b model, apply the linearization, and then quantize it to a W4A8 scheme (4-bit weights, 8-bit activations) using llmcompressor's oneshot functionality.
  • Quantization Scheme Details: The example script specifies a W4A8 quantization configuration: weights are 4-bit, channel-wise, symmetric, and static, while activations are 8-bit, per-token, asymmetric, and dynamic. The lm_head layer is ignored during quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for linearizing the gpt_oss model to enable quantization, along with an example script demonstrating w4a8 quantization. The core logic for converting the fused MoE experts into separate linear layers is well-structured. However, I've found a critical bug in the forward pass of the new LinearExperts module that would cause a crash when processing tokens assigned to the 'no expert' bucket. The example script is clear and provides a good demonstration of the new functionality.

@isharif168 isharif168 force-pushed the convert_to_linear_gpt_oss branch 2 times, most recently from 626807e to 8f5e79f Compare December 11, 2025 12:13
@isharif168
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to linearize the gpt_oss model's Mixture-of-Experts layers to enable quantization, along with an example script demonstrating its usage for W4A8 quantization. The implementation for converting the fused MoE experts into separate linear layers is well-structured and correct. The example script is clear and effectively showcases the new capability. I have one suggestion to improve the library design by removing a direct print statement from a utility function.

@isharif168 isharif168 force-pushed the convert_to_linear_gpt_oss branch from 8f5e79f to 062ad15 Compare December 11, 2025 12:20
@isharif168
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for quantizing the GPT-OSS model by linearizing its Mixture-of-Experts (MoE) layers. A new module, src/llmcompressor/modeling/gpt_oss.py, is added to handle the in-place conversion of the model's fused expert layers into separate, quantizable linear layers. The logic for de-interleaving weights and handling the sparse routing seems correct. Additionally, a new example script, examples/quantization_w4a8/gpt_oss_20b_example.py, effectively demonstrates the W4A8 quantization process on the gpt-oss-20b model. My review includes a couple of suggestions to improve the robustness of the model conversion code, including a more comprehensive sanity check to prevent potential runtime errors.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@isharif168 isharif168 force-pushed the convert_to_linear_gpt_oss branch from 062ad15 to e521436 Compare December 11, 2025 12:38
@isharif168
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for linearizing and quantizing the gpt_oss model. It adds a new module src/llmcompressor/modeling/gpt_oss.py to handle the model conversion from fused MoE experts to a format with standard nn.Linear layers, which is a prerequisite for quantization. Additionally, an example script examples/quantization_w4a8/gpt_oss_20b_example.py is included to demonstrate the W4A8 quantization process. The implementation of the model conversion is robust, correctly de-interleaving weights and replicating the MoE forward pass. The example script is clear and effectively showcases the new functionality. I have identified a latent bug in a utility function and a minor formatting issue that should be addressed to improve code quality and robustness.

@isharif168 isharif168 force-pushed the convert_to_linear_gpt_oss branch from e521436 to 05c808c Compare December 11, 2025 13:15
@isharif168
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to linearize the gpt_oss model's MoE layers, making them suitable for quantization, and includes an example script for w4a8 quantization. The core logic in src/llmcompressor/modeling/gpt_oss.py correctly de-fuses the expert layers and replaces them with quantization-friendly nn.Linear modules. The accompanying example, examples/quantization_w4a8/gpt_oss_20b_example.py, effectively demonstrates this new capability. The implementation is well-structured and documented. I have provided a couple of suggestions to enhance efficiency and error handling in the model conversion logic.

@dsikka
Copy link
Collaborator

dsikka commented Dec 11, 2025

Thank you! Do you have a sample model you can share with us?

@isharif168
Copy link
Author

Hi @dsikka , I have added the example as well which will convert and quantize the gpt_oss_20b mode
python gpt_oss_20b_example.py
This should download the model and also quantize it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants