Skip to content

[rollout]{feat}Ascend 950 hardware mxfp8 rollout quantization #5569

Open
zhijie-os wants to merge 3 commits intoverl-project:mainfrom
zhijie-os:A5-MXFP8
Open

[rollout]{feat}Ascend 950 hardware mxfp8 rollout quantization #5569
zhijie-os wants to merge 3 commits intoverl-project:mainfrom
zhijie-os:A5-MXFP8

Conversation

@zhijie-os
Copy link

What does this PR do?

Supporting latest Ascend hardware DV100 and DV120 for MXFP8 quantization.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

[TODO] tests will be done

API and Usage Example

Simply specify the quantization to ascend to enable MXFP8 on Ascend hardware

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Mar 12, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MXFP8 quantization on Ascend hardware. While the new functionality for Ascend devices appears sound, the implementation has introduced several critical regressions that break existing FP8 quantization for non-Ascend hardware. Key issues include a flawed logic for handling weight_block_size, incorrect scale parameter naming that disregards vLLM version differences, and the disabling of a crucial environment variable needed for FP8 patches. Furthermore, a change forces all users onto a less performant shared memory data transfer method, even when faster IPC is available. These issues must be resolved to maintain backward compatibility and prevent performance degradation for other users.

Comment on lines +241 to +249
is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config)

weight_block_size = None
# if quant_config.weight_block_size is None:
# raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config")
if hasattr(quant_config, "weight_block_size"):
weight_block_size = quant_config.weight_block_size
elif is_mxfp8_npu:
weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for determining weight_block_size is flawed. By commenting out the check for quant_config.weight_block_size, the non-NPU FP8 path will raise an AttributeError if quant_config does not have this attribute, as it's accessed directly later in the quantization loop. This breaks existing functionality. The check must be preserved for the non-NPU path.

Suggested change
is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config)
weight_block_size = None
# if quant_config.weight_block_size is None:
# raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config")
if hasattr(quant_config, "weight_block_size"):
weight_block_size = quant_config.weight_block_size
elif is_mxfp8_npu:
weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"]
is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config)
weight_block_size = None
if is_mxfp8_npu:
weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"]
else:
if not hasattr(quant_config, "weight_block_size") or quant_config.weight_block_size is None:
raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config")

Comment on lines +276 to +284
# Yield the scale with appropriate naming based on vLLM versio
yield (k + "_scale", param_scale)
# if is_vllm_11_or_later:
# if "expert" in k:

# else:
# yield (k + "_scale", param_scale)
# else:
# yield (k + "_scale_inv", param_scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for yielding the scale parameter has been simplified to always yield k + "_scale". This removes the previous logic that handled different vLLM versions and yielded k + "_scale_inv" when appropriate. This change breaks FP8 quantization for non-Ascend (NVIDIA) hardware, which may expect _scale_inv. The change should be made conditional to apply only for the new Ascend MXFP8 path.

Suggested change
# Yield the scale with appropriate naming based on vLLM versio
yield (k + "_scale", param_scale)
# if is_vllm_11_or_later:
# if "expert" in k:
# else:
# yield (k + "_scale", param_scale)
# else:
# yield (k + "_scale_inv", param_scale)
# Yield the scale with appropriate naming based on vLLM version
if is_mxfp8_npu:
yield (k + "_scale", param_scale)
elif is_vllm_11_or_later:
if "expert" in k:
yield (k + "_scale_inv", param_scale)
else:
yield (k + "_scale", param_scale)
else:
yield (k + "_scale_inv", param_scale)

apply_vllm_fp8_patches()
# for subprocesses patching
os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1"
# os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The line os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" has been commented out. This environment variable is necessary to enable the vLLM FP8 patches in worker subprocesses. Without it, the FP8 quantization for non-Ascend hardware will not work correctly. This line should be restored to ensure existing FP8 functionality is not broken.

Suggested change
# os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1"
os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1"

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants