Skip to content

Commit ca7eb64

Browse files
authored
DSV4 PTQ example with dequant on the fly (#1341)
### What does this PR do? Type of change: new example <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Add deepseek v4 official modeling ptq example ### Usage See readme, and it requires the vllm PR: vllm-project/vllm#42209 ### Testing Tested with ptq and export of dsv4 flash and served with vllm. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added DeepSeek‑V4 routed‑expert post‑training quantization and an NVFP4 checkpoint conversion utility. * **Documentation** * Expanded DeepSeek quantization guide with directory layout, updated V3/V3.2 workflows, and detailed V4 routed‑expert calibration, single/multi‑node examples, and export guidance. * **Chores** * Made example quantization scripts location‑independent. * Updated pre‑commit license hook to skip DeepSeek example quantization files. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1341?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com>
1 parent f9423c0 commit ca7eb64

8 files changed

Lines changed: 1271 additions & 9 deletions

File tree

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,8 @@ repos:
103103
modelopt/torch/speculative/eagle/utils.py|
104104
modelopt/torch/speculative/plugins/hf_medusa.py|
105105
modelopt/torch/utils/plugins/megatron_mmlu.py|
106-
examples/deepseek/quantize_to_nvfp4.py|
107-
examples/deepseek/ptq.py|
106+
examples/deepseek/deepseek_v3/quantize_to_nvfp4.py|
107+
examples/deepseek/deepseek_v3/ptq.py|
108108
examples/diffusers/quantization/onnx_utils/export.py|
109109
examples/llm_eval/lm_eval_hf.py|
110110
examples/llm_eval/mmlu.py|

examples/deepseek/README.md

Lines changed: 86 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,14 @@ This example will demonstrate the steps to quantize DeepSeek models to FP4 and e
66

77
Due to the model size, currently it requires 8xH200 or 16xH100 to quantize the FP8 model, we will use 8xH200 as example.
88

9-
## Convert the HF checkpoint for deepseek FP8 inference
9+
## Directory Layout
10+
11+
- `deepseek_v3/`: DeepSeek V3, R1, V3.1, and V3.2 FP4 quantization.
12+
- `deepseek_v4/`: DeepSeek V4 routed-expert NVFP4 quantization.
13+
14+
## DeepSeek V3 FP4
15+
16+
### Convert the HF checkpoint for DeepSeek FP8 inference
1017

1118
```bash
1219
# set up variables to run the example
@@ -54,13 +61,13 @@ python inference/convert.py --hf-ckpt-path $HF_FP8_CKPT --save-path $DS_CKPT --n
5461
DeepSeek V3, R1, V3.1
5562

5663
```bash
57-
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
64+
torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
5865
```
5966

6067
DeepSeek V3.2
6168

6269
```bash
63-
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
70+
torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH
6471
```
6572

6673
#### MoE expert calibration
@@ -78,7 +85,7 @@ during calibration (slower, ~2x forwards, no post-calibration sync) — pass
7885
`--calib_all_experts`:
7986

8087
```bash
81-
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH --calib_all_experts
88+
torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py --model_path $DS_CKPT --config DeepSeek-V3.2-Exp/inference/config_671B_v3.2.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $FP4_QUANT_PATH --calib_all_experts
8289
```
8390

8491
A summary of every TensorQuantizer is written to `$FP4_QUANT_PATH/.quant_summary.txt`.
@@ -91,5 +98,79 @@ We provide a one-step-script which will:
9198
- Copy miscellaneous files to the quantized checkpoint
9299

93100
```bash
94-
./quantize_fp8_to_nvfp4.sh --amax_path $FP4_QUANT_PATH --fp4_output_path $HF_FP4_PATH --fp8_hf_path $HF_FP8_CKPT --world_size 8
101+
./deepseek_v3/quantize_fp8_to_nvfp4.sh --amax_path $FP4_QUANT_PATH --fp4_output_path $HF_FP4_PATH --fp8_hf_path $HF_FP8_CKPT --world_size 8
95102
```
103+
104+
## DeepSeek V4 routed-expert NVFP4
105+
106+
DeepSeek V4 uses a mixed native checkpoint layout. The V4 recipe quantizes
107+
only the routed experts to NVFP4 W4A4 and leaves attention projections, the
108+
router gate, shared experts, embeddings, and the LM head in their original
109+
formats.
110+
111+
### Prepare the MP checkpoint
112+
113+
Keep experts in MXFP4 when resharding with DeepSeek's own `convert.py`:
114+
115+
```bash
116+
export DS_V4=/path/to/DeepSeek-V4-Pro
117+
export MP=8
118+
export MP_CKPT=/path/to/DeepSeek-V4-Pro-mp${MP}-mxfp4
119+
export AMAX=/path/to/amax-nvfp4-experts
120+
export HF_NVFP4_PATH=/path/to/DeepSeek-V4-Pro-nvfp4-experts
121+
122+
python ${DS_V4}/inference/convert.py \
123+
--hf-ckpt-path ${DS_V4} \
124+
--save-path ${MP_CKPT} \
125+
--n-experts 384 \
126+
--model-parallel ${MP}
127+
```
128+
129+
### Calibrate routed experts
130+
131+
Single node:
132+
133+
```bash
134+
torchrun --nproc-per-node ${MP} --master_port 12346 deepseek_v4/ptq.py \
135+
--model_path ${MP_CKPT} \
136+
--config ${DS_V4}/inference/config.json \
137+
--dsv4_inference_dir ${DS_V4}/inference \
138+
--output_path ${AMAX}
139+
```
140+
141+
Two 4-GPU nodes for `MP=8`:
142+
143+
```bash
144+
# node 0
145+
torchrun --nnodes=2 --node_rank=0 --master_addr=<ip> --master_port=12346 \
146+
--nproc-per-node 4 deepseek_v4/ptq.py \
147+
--model_path ${MP_CKPT} \
148+
--config ${DS_V4}/inference/config.json \
149+
--dsv4_inference_dir ${DS_V4}/inference \
150+
--output_path ${AMAX}
151+
152+
# node 1
153+
torchrun --nnodes=2 --node_rank=1 --master_addr=<ip> --master_port=12346 \
154+
--nproc-per-node 4 deepseek_v4/ptq.py \
155+
--model_path ${MP_CKPT} \
156+
--config ${DS_V4}/inference/config.json \
157+
--dsv4_inference_dir ${DS_V4}/inference \
158+
--output_path ${AMAX}
159+
```
160+
161+
### Export back to HF shard layout
162+
163+
`deepseek_v4/quantize_to_nvfp4.py` operates on the original HF-style V4 checkpoint and
164+
produces a new HF-style checkpoint with routed expert weights replaced by
165+
NVFP4 tensors plus `weight_scale`, `weight_scale_2`, and `input_scale`.
166+
167+
```bash
168+
python deepseek_v4/quantize_to_nvfp4.py \
169+
--amax_path ${AMAX} \
170+
--source_ckpt ${DS_V4} \
171+
--output_ckpt ${HF_NVFP4_PATH}
172+
```
173+
174+
The output includes an updated `model.safetensors.index.json`, a `config.json`
175+
with `quantization_config.moe_quant_algo = "NVFP4"`, and `hf_quant_config.json`
176+
describing the mixed NVFP4 expert layers.

examples/deepseek/quantize_fp8_to_nvfp4.sh renamed to examples/deepseek/deepseek_v3/quantize_fp8_to_nvfp4.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616

1717
set -e # Exit immediately if any command fails
1818

19+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
20+
1921
usage() {
2022
echo "Usage: $0 --amax_path <path> --fp4_output_path <path> --fp8_hf_path <path> [--world_size <n>]"
2123
exit 1
@@ -84,7 +86,7 @@ cp -r $FP8_HF_PATH/assets $FP4_PATH/ || true
8486

8587
# Run the quantization command
8688
echo "Running quantization..."
87-
python quantize_to_nvfp4.py \
89+
python "$SCRIPT_DIR/quantize_to_nvfp4.py" \
8890
--amax_path "$AMAX_PATH" \
8991
--fp4_path "$FP4_PATH" \
9092
--fp8_hf_path "$FP8_HF_PATH" \
File renamed without changes.

0 commit comments

Comments
 (0)