[GPT-OSS] support fp8 online quantization for gpt-oss bf16#18988
[GPT-OSS] support fp8 online quantization for gpt-oss bf16#18988zminglei wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zminglei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for FP8 online quantization in GPT-OSS bf16 models. It resolves compatibility issues by ensuring that the appropriate MOE backend is selected when quantization is enabled, specifically preventing the use of Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for FP8 online quantization for GPT-OSS models, which is a valuable enhancement. The changes are well-implemented, particularly the addition of bias handling in Fp8MoEMethod and the adjustment of the MoE backend selection logic in server_args.py. The code is clear, follows existing patterns, and correctly uses getattr for safe access to optional bias parameters. The logic to prevent the use of the triton_kernel backend with quantization is also sound. Overall, this is a solid contribution.
|
/tag-and-rerun-ci again |
GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias). When serving the BF16 model with `--quantization fp8`, the Fp8MoEMethod does not register bias parameters, causing weight loading failures. This adds bias support to Fp8MoEMethod: - Register w13_bias/w2_bias in create_weights() when moe.has_bias is set - Pass biases through to fused_experts() in apply() - Guard against unsupported FusedMoEModularKernel + bias combination Tested on 4xH200 with GPT-OSS-120B BF16: - vllm serve --quantization fp8 loads successfully with bias - GSM8K accuracy maintained (0.834 FP8 vs 0.848 BF16) Companion PR: sgl-project/sglang#18988 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias). When serving the BF16 model with `--quantization fp8`, the Fp8MoEMethod does not register bias parameters, causing weight loading failures. This adds bias support to Fp8MoEMethod: - Register w13_bias/w2_bias in create_weights() when moe.has_bias is set - Pass biases through to fused_experts() in apply() - Guard against unsupported FusedMoEModularKernel + bias combination Tested on 4xH200 with GPT-OSS-120B BF16: - vllm serve --quantization fp8 loads successfully with bias - GSM8K accuracy maintained (0.834 FP8 vs 0.848 BF16) Companion PR: sgl-project/sglang#18988 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias). When serving the BF16 model with `--quantization fp8`, Fp8MoEMethod and Fp8OnlineMoEMethod don't register bias parameters, causing weight loading failures. This adds bias support to both FP8 MoE method classes: - Register w13_bias/w2_bias in Fp8MoEMethod.create_weights() when moe.has_bias is set - Inject biases into quant_config via get_fused_moe_quant_config() - Register biases in Fp8OnlineMoEMethod.create_weights() using the original (unpatched) weight_loader Tested on 4xH200 with GPT-OSS-120B BF16 + vllm 0.15.1: - vllm serve --quantization fp8 loads and serves successfully - TRITON Fp8 MoE backend selected correctly - GSM8K accuracy: 0.834 (FP8) vs 0.848 (BF16) - 1.5x throughput improvement with FP8 Companion PR: sgl-project/sglang#18988 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Motivation
moe_runner_backendasautowhen launch gpt-oss bf16 with online quantization (e.g. fp8) to pick up eitherdeep_gemmortritonmoe backend, sincetriton_kernelsmoe backend doesn't support online quantization like fp8 yet.FP8MoeMethodto acceptwith_biasto support models with bias in moe projs, like GPT-OSS.Modifications
Accuracy Tests
Before:
After:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci