[Feature] Support FP4 communication quantization#7488
[Feature] Support FP4 communication quantization#7488lizexu123 wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-19 23:37 CST\n\n## 📋 Review 摘要\n\nPR 概述:支持 FP4 通信量化,在 EP prefill dispatch 场景下将通信量降低至 BF16 的约 28%,显著减少 all-to-all 通信瓶颈。\n变更范围:custom_ops/gpu_ops/moe/(C++ 算子)、model_executor/layers/quantization/nvfp4.py(Python 量化逻辑)、envs.py(环境变量)、forward_meta.py、utils.py\n影响面 Tag:OPQuantization\n\n### 📝 PR 规范检查\n\nPR 描述中 Modifications、Usage or Command、Accuracy Tests 三个章节均为空,建议补充。该 PR 影响模型前向计算(MoE FFN 路径),应提供精度测试结果。\n\n描述模板(可直接复制):\n\n## Modifications\n1. 新增环境变量 FD_USE_NVFP4_COMM_QUANT 控制 FP4 通信量化开关\n2. 在 nvfp4.py apply_ep_prefill 中增加 FP4 预量化 dispatch 路径\n3. C++ 算子 PrefillPermuteToMaskedGemm 新增 UINT8 数据类型支持\n4. utils.py 增加未初始化权重的防御性检查\n\n## Usage or Command\nFD_USE_NVFP4_COMM_QUANT=1 python -m fastdeploy.entrypoints.openai.api_server ...\n\n## Accuracy Tests\n(请补充 EP 场景下开启/关闭 FP4 通信量化的精度对比结果)\n\n\n### 问题\n\n| 级别 | 文件 | 概述 |\n|------|------|------|\n| 🟡 建议 |utils.py:136| 调试日志使用非正式中文,不适合生产代码 |\n| ❓ 疑问 |forward_meta.py:156|audio_token_num字段与本 PR 功能无关且未被使用 |\n\n### 总体评价\n\nFP4 通信量化的核心逻辑(预量化 → dispatch → scale swizzle → masked GEMM)实现合理,注释清晰,FP4/BF16 两条路径分支清楚。建议补充 PR 描述中的精度测试结果,并清理两处小问题。"
|
|
||
| def process_weight_transpose(layer, weight_name): | ||
| weight = getattr(layer, weight_name) | ||
| if not weight._is_initialized(): |
There was a problem hiding this comment.
🟡 建议 调试日志使用了非正式中文 "权重没初始化啊!",不适合生产代码。
- 日志语言应与项目其他日志保持一致(英文),且语气应正式;
- 日志级别建议使用
logger.warning而非logger.info——权重未初始化导致跳过 transpose 可能掩盖上游问题,warning 级别更有助于排查。
建议修改为:
if not weight._is_initialized():
logger.warning("Weight '%s' is not initialized, skipping transpose.", weight_name)
return| moe_num_chunk: int = 1 | ||
| max_moe_num_chunk: int = 1 | ||
|
|
||
| audio_token_num: int = 0 |
There was a problem hiding this comment.
❓ 疑问 audio_token_num 字段与本 PR 的 FP4 通信量化功能似乎无关,且搜索整个代码库后未发现任何读取或写入此字段的代码。
请确认:这个字段是否应该在单独的 PR 中提交?如果是后续功能的前置准备,建议在注释中说明用途。
Motivation
支持fp4 通信量化,以hidden_size = 7168为例子
FP4 通信量 = BF16 的 28%,减少约 3.5 倍。在 EP 场景下(8卡 all-to-all),通信往往是瓶颈
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.