Skip to content

Add pre-quantized FP4 MoE weight loading#1906

Open
bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
fluidstackio:dsv3-fp4-moe-prequant
Open

Add pre-quantized FP4 MoE weight loading#1906
bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
fluidstackio:dsv3-fp4-moe-prequant

Conversation

@bgchun-fs
Copy link
Contributor

Description

Add support for loading pre-quantized FP4 MoE weights, skipping the runtime FP8→FP4 dequant→requant cycle during model startup.

Currently, FP8 MoE models go through dequant→FP32→requant→FP4 at load time, which takes ~45 min on CPU for DeepSeek-V3 671B.

Pre-quantizing MoE experts to FP4 offline reduces total model size from ~650 GB (full FP8) to ~338 GB (~48% reduction), since MoE expert weights make up the majority of the model.

This PR adds:

  • MOE_SKIP_REQUANTIZE env var to skip runtime requantization
  • create_weights override to allocate uint8-packed FP4 weight buffers
  • uint8→float4_e2m1fn unpacking in process_weights_after_loading
  • DSV3 converter script (scripts/convert/dsv3_converter.py) that converts FP8 2D-subchannel [128,128] → 1D-subchannel [1,N] with optional --fp4 flag for MoE expert FP4 packing

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Tested end-to-end with DeepSeek-V3.1 671B (vllm path, DP attention):

  1. Converted weights with dsv3_converter.py --fp4
  2. Served with MOE_SKIP_REQUANTIZE=1 MOE_REQUANTIZE_BLOCK_SIZE=512
  3. Verified correct inference output

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@bgchun-fs bgchun-fs force-pushed the dsv3-fp4-moe-prequant branch 3 times, most recently from 6ad82ab to 1ea7a11 Compare March 11, 2026 11:09
- Skip runtime FP8→FP4 MoE requantization when MOE_SKIP_REQUANTIZE=1
- Load FP4 weights stored as packed uint8 (2 values/byte)
- Add DSV3 converter script (FP8 2D→1D + optional FP4 MoE packing)

Signed-off-by: Byonggon Chun <byonggon@fluidstack.io>
@bgchun-fs bgchun-fs force-pushed the dsv3-fp4-moe-prequant branch from 1ea7a11 to eb793eb Compare March 11, 2026 11:13
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrplatin Hi, I tried to figure out how you compiled the following models. Could you verify whether this is correct? Thanks.

  • jrplatin/DeepSeek-R1-1D-Subchannel-256
  • jrplatin/DeepSeek-R1-1D-Subchannel-256-Packed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant