Skip to content

[Feature] Support FP4 communication quantization#7488

Open
lizexu123 wants to merge 3 commits intoPaddlePaddle:developfrom
lizexu123:kc
Open

[Feature] Support FP4 communication quantization#7488
lizexu123 wants to merge 3 commits intoPaddlePaddle:developfrom
lizexu123:kc

Conversation

@lizexu123
Copy link
Copy Markdown
Collaborator

@lizexu123 lizexu123 commented Apr 19, 2026

Motivation

支持fp4 通信量化,以hidden_size = 7168为例子

路径 数据 Scale 总计 (per token)
BF16 7168 × 2B = 14336B 14336B
FP4 3584 × 1B = 3584B 112 × 4B = 448B 4032B

FP4 通信量 = BF16 的 28%,减少约 3.5 倍。在 EP 场景下(8卡 all-to-all),通信往往是瓶颈

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 19, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lizexu123
❌ root


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-19 23:37 CST\n\n## 📋 Review 摘要\n\nPR 概述:支持 FP4 通信量化,在 EP prefill dispatch 场景下将通信量降低至 BF16 的约 28%,显著减少 all-to-all 通信瓶颈。\n变更范围custom_ops/gpu_ops/moe/(C++ 算子)、model_executor/layers/quantization/nvfp4.py(Python 量化逻辑)、envs.py(环境变量)、forward_meta.pyutils.py\n影响面 TagOP Quantization\n\n### 📝 PR 规范检查\n\nPR 描述中 ModificationsUsage or CommandAccuracy Tests 三个章节均为空,建议补充。该 PR 影响模型前向计算(MoE FFN 路径),应提供精度测试结果。\n\n描述模板(可直接复制):\n\n## Modifications\n1. 新增环境变量 FD_USE_NVFP4_COMM_QUANT 控制 FP4 通信量化开关\n2. 在 nvfp4.py apply_ep_prefill 中增加 FP4 预量化 dispatch 路径\n3. C++ 算子 PrefillPermuteToMaskedGemm 新增 UINT8 数据类型支持\n4. utils.py 增加未初始化权重的防御性检查\n\n## Usage or Command\nFD_USE_NVFP4_COMM_QUANT=1 python -m fastdeploy.entrypoints.openai.api_server ...\n\n## Accuracy Tests\n(请补充 EP 场景下开启/关闭 FP4 通信量化的精度对比结果)\n\n\n### 问题\n\n| 级别 | 文件 | 概述 |\n|------|------|------|\n| 🟡 建议 | utils.py:136 | 调试日志使用非正式中文,不适合生产代码 |\n| ❓ 疑问 | forward_meta.py:156 | audio_token_num 字段与本 PR 功能无关且未被使用 |\n\n### 总体评价\n\nFP4 通信量化的核心逻辑(预量化 → dispatch → scale swizzle → masked GEMM)实现合理,注释清晰,FP4/BF16 两条路径分支清楚。建议补充 PR 描述中的精度测试结果,并清理两处小问题。"


def process_weight_transpose(layer, weight_name):
weight = getattr(layer, weight_name)
if not weight._is_initialized():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 调试日志使用了非正式中文 "权重没初始化啊!",不适合生产代码。

  1. 日志语言应与项目其他日志保持一致(英文),且语气应正式;
  2. 日志级别建议使用 logger.warning 而非 logger.info——权重未初始化导致跳过 transpose 可能掩盖上游问题,warning 级别更有助于排查。

建议修改为:

if not weight._is_initialized():
    logger.warning("Weight '%s' is not initialized, skipping transpose.", weight_name)
    return

moe_num_chunk: int = 1
max_moe_num_chunk: int = 1

audio_token_num: int = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 audio_token_num 字段与本 PR 的 FP4 通信量化功能似乎无关,且搜索整个代码库后未发现任何读取或写入此字段的代码。

请确认:这个字段是否应该在单独的 PR 中提交?如果是后续功能的前置准备,建议在注释中说明用途。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants