-
Notifications
You must be signed in to change notification settings - Fork 315
[CPU] Linearize gpt_oss model and add example to quantize it to w4a8 #2113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @isharif168, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enables the quantization of GPT-OSS models by introducing a mechanism to linearize their Mixture-of-Experts (MoE) layers. By converting the fused expert weights into separate linear projections, the models become compatible with the LLM Compressor's W4A8 quantization scheme. A new example script is provided to illustrate this process, allowing users to quantize the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for linearizing the gpt_oss model to enable quantization, along with an example script demonstrating w4a8 quantization. The core logic for converting the fused MoE experts into separate linear layers is well-structured. However, I've found a critical bug in the forward pass of the new LinearExperts module that would cause a crash when processing tokens assigned to the 'no expert' bucket. The example script is clear and provides a good demonstration of the new functionality.
626807e to
8f5e79f
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces functionality to linearize the gpt_oss model's Mixture-of-Experts layers to enable quantization, along with an example script demonstrating its usage for W4A8 quantization. The implementation for converting the fused MoE experts into separate linear layers is well-structured and correct. The example script is clear and effectively showcases the new capability. I have one suggestion to improve the library design by removing a direct print statement from a utility function.
8f5e79f to
062ad15
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for quantizing the GPT-OSS model by linearizing its Mixture-of-Experts (MoE) layers. A new module, src/llmcompressor/modeling/gpt_oss.py, is added to handle the in-place conversion of the model's fused expert layers into separate, quantizable linear layers. The logic for de-interleaving weights and handling the sparse routing seems correct. Additionally, a new example script, examples/quantization_w4a8/gpt_oss_20b_example.py, effectively demonstrates the W4A8 quantization process on the gpt-oss-20b model. My review includes a couple of suggestions to improve the robustness of the model conversion code, including a more comprehensive sanity check to prevent potential runtime errors.
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
062ad15 to
e521436
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for linearizing and quantizing the gpt_oss model. It adds a new module src/llmcompressor/modeling/gpt_oss.py to handle the model conversion from fused MoE experts to a format with standard nn.Linear layers, which is a prerequisite for quantization. Additionally, an example script examples/quantization_w4a8/gpt_oss_20b_example.py is included to demonstrate the W4A8 quantization process. The implementation of the model conversion is robust, correctly de-interleaving weights and replicating the MoE forward pass. The example script is clear and effectively showcases the new functionality. I have identified a latent bug in a utility function and a minor formatting issue that should be addressed to improve code quality and robustness.
… to w4a8 Signed-off-by: Sharif Inamdar <[email protected]>
e521436 to
05c808c
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces functionality to linearize the gpt_oss model's MoE layers, making them suitable for quantization, and includes an example script for w4a8 quantization. The core logic in src/llmcompressor/modeling/gpt_oss.py correctly de-fuses the expert layers and replaces them with quantization-friendly nn.Linear modules. The accompanying example, examples/quantization_w4a8/gpt_oss_20b_example.py, effectively demonstrates this new capability. The implementation is well-structured and documented. I have provided a couple of suggestions to enhance efficiency and error handling in the model conversion logic.
|
Thank you! Do you have a sample model you can share with us? |
|
Hi @dsikka , I have added the example as well which will convert and quantize the gpt_oss_20b mode |
SUMMARY:
Add code to convert the gpt_oss model and then quantize model to w8a8
TEST PLAN:
Tested with gpt_oss model and run using vllm
Sample output
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are a helpful assistant.
<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant
--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:
Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
Supportive design
We should provide a short answer.assistantfinalHere are three reasons to use AI:
Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
finish_reason: stop
num_tokens: 226