Add pre-quantized FP4 MoE weight loading#1906
Open
bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
Open
Add pre-quantized FP4 MoE weight loading#1906bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
Conversation
6ad82ab to
1ea7a11
Compare
- Skip runtime FP8→FP4 MoE requantization when MOE_SKIP_REQUANTIZE=1 - Load FP4 weights stored as packed uint8 (2 values/byte) - Add DSV3 converter script (FP8 2D→1D + optional FP4 MoE packing) Signed-off-by: Byonggon Chun <byonggon@fluidstack.io>
1ea7a11 to
eb793eb
Compare
bgchun-fs
commented
Mar 11, 2026
Contributor
Author
There was a problem hiding this comment.
@jrplatin Hi, I tried to figure out how you compiled the following models. Could you verify whether this is correct? Thanks.
- jrplatin/DeepSeek-R1-1D-Subchannel-256
- jrplatin/DeepSeek-R1-1D-Subchannel-256-Packed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add support for loading pre-quantized FP4 MoE weights, skipping the runtime FP8→FP4 dequant→requant cycle during model startup.
Currently, FP8 MoE models go through dequant→FP32→requant→FP4 at load time, which takes ~45 min on CPU for DeepSeek-V3 671B.
Pre-quantizing MoE experts to FP4 offline reduces total model size from ~650 GB (full FP8) to ~338 GB (~48% reduction), since MoE expert weights make up the majority of the model.
This PR adds:
MOE_SKIP_REQUANTIZEenv var to skip runtime requantizationcreate_weightsoverride to allocate uint8-packed FP4 weight buffersprocess_weights_after_loadingscripts/convert/dsv3_converter.py) that converts FP8 2D-subchannel [128,128] → 1D-subchannel [1,N] with optional--fp4flag for MoE expert FP4 packingIf the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456
Tests
Tested end-to-end with DeepSeek-V3.1 671B (vllm path, DP attention):
dsv3_converter.py --fp4MOE_SKIP_REQUANTIZE=1 MOE_REQUANTIZE_BLOCK_SIZE=512Checklist
Before submitting this PR, please make sure: