Add pre-quantized FP4 MoE weight loading by bgchun-fs · Pull Request #1906 · vllm-project/tpu-inference

bgchun-fs · 2026-03-11T10:47:00Z

Description

Add support for loading pre-quantized FP4 MoE weights, skipping the runtime FP8→FP4 dequant→requant cycle during model startup.

Currently, FP8 MoE models go through dequant→FP32→requant→FP4 at load time, which takes ~45 min on CPU for DeepSeek-V3 671B.

Pre-quantizing MoE experts to FP4 offline reduces total model size from ~650 GB (full FP8) to ~338 GB (~48% reduction), since MoE expert weights make up the majority of the model.

This PR adds:

MOE_SKIP_REQUANTIZE env var to skip runtime requantization
create_weights override to allocate uint8-packed FP4 weight buffers
uint8→float4_e2m1fn unpacking in process_weights_after_loading
DSV3 converter script (scripts/convert/dsv3_converter.py) that converts FP8 2D-subchannel [128,128] → 1D-subchannel [1,N] with optional --fp4 flag for MoE expert FP4 packing

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Tested end-to-end with DeepSeek-V3.1 671B (vllm path, DP attention):

Converted weights with dsv3_converter.py --fp4
Served with MOE_SKIP_REQUANTIZE=1 MOE_REQUANTIZE_BLOCK_SIZE=512
Verified correct inference output

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

- Skip runtime FP8→FP4 MoE requantization when MOE_SKIP_REQUANTIZE=1 - Load FP4 weights stored as packed uint8 (2 values/byte) - Add DSV3 converter script (FP8 2D→1D + optional FP4 MoE packing) Signed-off-by: Byonggon Chun <byonggon@fluidstack.io>

bgchun-fs · 2026-03-11T20:39:46Z

scripts/convert/dsv3_converter.py

@jrplatin Hi, I tried to figure out how you compiled the following models. Could you verify whether this is correct? Thanks.

jrplatin/DeepSeek-R1-1D-Subchannel-256

jrplatin/DeepSeek-R1-1D-Subchannel-256-Packed

bgchun-fs requested review from bzgoogle, gpolovets1, jrplatin, kyuyeunk, vanbasten23 and vipannalla as code owners March 11, 2026 10:47

bgchun-fs force-pushed the dsv3-fp4-moe-prequant branch 3 times, most recently from 6ad82ab to 1ea7a11 Compare March 11, 2026 11:09

bgchun-fs force-pushed the dsv3-fp4-moe-prequant branch from 1ea7a11 to eb793eb Compare March 11, 2026 11:13

Merge branch 'main' into dsv3-fp4-moe-prequant

766e1e7

bgchun-fs commented Mar 11, 2026

View reviewed changes

bgchun-fs added 2 commits March 11, 2026 17:16

Merge branch 'main' into dsv3-fp4-moe-prequant

db2b792

Merge branch 'main' into dsv3-fp4-moe-prequant

b02b757

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-quantized FP4 MoE weight loading#1906

Add pre-quantized FP4 MoE weight loading#1906
bgchun-fs wants to merge 4 commits intovllm-project:mainfrom
fluidstackio:dsv3-fp4-moe-prequant

bgchun-fs commented Mar 11, 2026

Uh oh!

bgchun-fs Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bgchun-fs commented Mar 11, 2026

Description

Tests

Checklist

Uh oh!

bgchun-fs Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant