[rollout]{feat}Ascend 950 hardware mxfp8 rollout quantization #5569
[rollout]{feat}Ascend 950 hardware mxfp8 rollout quantization #5569zhijie-os wants to merge 3 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for MXFP8 quantization on Ascend hardware. While the new functionality for Ascend devices appears sound, the implementation has introduced several critical regressions that break existing FP8 quantization for non-Ascend hardware. Key issues include a flawed logic for handling weight_block_size, incorrect scale parameter naming that disregards vLLM version differences, and the disabling of a crucial environment variable needed for FP8 patches. Furthermore, a change forces all users onto a less performant shared memory data transfer method, even when faster IPC is available. These issues must be resolved to maintain backward compatibility and prevent performance degradation for other users.
| is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config) | ||
|
|
||
| weight_block_size = None | ||
| # if quant_config.weight_block_size is None: | ||
| # raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config") | ||
| if hasattr(quant_config, "weight_block_size"): | ||
| weight_block_size = quant_config.weight_block_size | ||
| elif is_mxfp8_npu: | ||
| weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"] |
There was a problem hiding this comment.
The logic for determining weight_block_size is flawed. By commenting out the check for quant_config.weight_block_size, the non-NPU FP8 path will raise an AttributeError if quant_config does not have this attribute, as it's accessed directly later in the quantization loop. This breaks existing functionality. The check must be preserved for the non-NPU path.
| is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config) | |
| weight_block_size = None | |
| # if quant_config.weight_block_size is None: | |
| # raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config") | |
| if hasattr(quant_config, "weight_block_size"): | |
| weight_block_size = quant_config.weight_block_size | |
| elif is_mxfp8_npu: | |
| weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"] | |
| is_mxfp8_npu = is_mxfp8_vllm_ascend(quant_config) | |
| weight_block_size = None | |
| if is_mxfp8_npu: | |
| weight_block_size = MXFP8_BLOCK_QUANT_KWARGS["weight_block_size"] | |
| else: | |
| if not hasattr(quant_config, "weight_block_size") or quant_config.weight_block_size is None: | |
| raise ValueError("Currently only support blockwise quantization, please set weight_block_size in quant_config") |
| # Yield the scale with appropriate naming based on vLLM versio | ||
| yield (k + "_scale", param_scale) | ||
| # if is_vllm_11_or_later: | ||
| # if "expert" in k: | ||
|
|
||
| # else: | ||
| # yield (k + "_scale", param_scale) | ||
| # else: | ||
| # yield (k + "_scale_inv", param_scale) |
There was a problem hiding this comment.
The logic for yielding the scale parameter has been simplified to always yield k + "_scale". This removes the previous logic that handled different vLLM versions and yielded k + "_scale_inv" when appropriate. This change breaks FP8 quantization for non-Ascend (NVIDIA) hardware, which may expect _scale_inv. The change should be made conditional to apply only for the new Ascend MXFP8 path.
| # Yield the scale with appropriate naming based on vLLM versio | |
| yield (k + "_scale", param_scale) | |
| # if is_vllm_11_or_later: | |
| # if "expert" in k: | |
| # else: | |
| # yield (k + "_scale", param_scale) | |
| # else: | |
| # yield (k + "_scale_inv", param_scale) | |
| # Yield the scale with appropriate naming based on vLLM version | |
| if is_mxfp8_npu: | |
| yield (k + "_scale", param_scale) | |
| elif is_vllm_11_or_later: | |
| if "expert" in k: | |
| yield (k + "_scale_inv", param_scale) | |
| else: | |
| yield (k + "_scale", param_scale) | |
| else: | |
| yield (k + "_scale_inv", param_scale) |
| apply_vllm_fp8_patches() | ||
| # for subprocesses patching | ||
| os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" | ||
| # os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" |
There was a problem hiding this comment.
The line os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" has been commented out. This environment variable is necessary to enable the vLLM FP8 patches in worker subprocesses. Without it, the FP8 quantization for non-Ascend hardware will not work correctly. This line should be restored to ensure existing FP8 functionality is not broken.
| # os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" | |
| os.environ["VERL_VLLM_FP8_QUANT_ENABLED"] = "1" |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
What does this PR do?
Supporting latest Ascend hardware DV100 and DV120 for MXFP8 quantization.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
[TODO] tests will be done
API and Usage Example
Simply specify the quantization to
ascendto enable MXFP8 on Ascend hardwareDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.