[Performance]Reduce sampling#8308
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant optimization to the sampling process within a distributed (Tensor Parallel) environment, termed 'reduce sampling'. The primary goal is to enhance performance by minimizing communication overhead during token selection. This is achieved through specialized implementations for greedy, top-k, and top-p sampling that intelligently aggregate and process vocabulary information across parallel ranks. The changes also adapt the speculative decoding rejection sampling mechanism to work seamlessly with this new, more efficient distributed sampling approach, and integrate these optimizations into the model's logits computation and overall sampling workflow. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Ops][Feature] Implement optimized reduce sampling scheme for Ascend NPUsSuggested PR Summary:
### What this PR does / why we need it?
This PR implements an optimized "reduce sampling" scheme for Ascend NPUs to enhance performance in distributed speculative decoding. It introduces the `AscendRejectionSampler` and updates the sampling pipeline to use a compressed logit flow (top-k -> allgather -> top-p), significantly reducing inter-node communication by avoiding full vocabulary gathers. Key changes include updates to Triton kernels for rejection sampling and bincount to handle tensor parallel ranks, and a new configuration flag `enable_reduce_sample`.
Feedback: Several critical issues were identified during review. The Triton kernels in `reject_sample.py` incorrectly use the batch `BLOCK_SIZE` for vocabulary indexing, which will fail when the compressed vocabulary size exceeds the batch size. Additionally, `AscendVocabParallelEmbedding` contains an `AttributeError` and incorrect slicing logic that ignores padding. Finally, the patch for `Eagle3LlamaForCausalLM` introduces a breaking change to the `compute_logits` API return type.
### Does this PR introduce _any_ user-facing change?
Yes, it introduces the `enable_reduce_sample` configuration option and modifies the internal return signature of `compute_logits` for Eagle3 models, which may affect custom integrations.
### How was this patch tested?
CI tests should be performed to ensure the new sampling flow maintains accuracy. Specific attention should be paid to verifying the Triton kernels with compressed vocabulary sizes larger than the batch size.|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
29d3256 to
24ce79d
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
need avoid to amend related function in reject_sample. |
|
|
||
| draft_token_ids = logits.argmax(dim=-1) | ||
| logits = self.model.compute_logits(sample_hidden_states, get_ascend_config().enable_reduce_sample) | ||
| if not get_ascend_config().enable_reduce_sample: |
There was a problem hiding this comment.
I prefer to enable this feature by default, I think this brings a general performance gain.
There was a problem hiding this comment.
I have set this optimization to be enabled by default, so no manual activation is required in the settings.
| else: | ||
| last_hidden_states, hidden_states = ret_hidden_states | ||
|
|
||
| if self.method != "dflash": |
There was a problem hiding this comment.
Check whether the DFlash needs to be determined.
There was a problem hiding this comment.
Fixed. This optimization also applies to DFlash.
| multi_steps_attn_metadata = [MagicMock(), MagicMock(), MagicMock()] | ||
|
|
||
| mock_ascend_config = MagicMock() | ||
| mock_ascend_config.enable_reduce_sample = False |
There was a problem hiding this comment.
The scenario where reduce_sample is true is added, and assert is modified or added.
There was a problem hiding this comment.
After enabling this optimization by default, the issue of incomplete UT test coverage has been resolved.
There was a problem hiding this comment.
Check whether reduce_sample and lmhead can be enabled at the same time. The logic of the judgment in this line is inconsistent with that in line 930.
There was a problem hiding this comment.
Fixed. This optimization can still be enabled in the lmhead_tp_enable scenario by simply delaying the all-to-all communication following the lmhead layer.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
| vocab_size, # vocab_size or selected_vocab_size if ENABLE_REDUCE_SAMPLING | ||
| global_vocab_size, # global vocab size for draft_probs indexing (only used if ENABLE_REDUCE_SAMPLING) |
There was a problem hiding this comment.
it seems the suggestions from @weijinqian0 has not been fixed now?
This reverts commit 526141c.
This reverts commit 526141c.
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. #8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: wenke <tangzhengzhi@huawei.com>
What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>
What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios.
In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy.
Does this PR introduce any user-facing change?
No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default.
How was this patch tested?
Tested with vLLM version based on v0.20.1.
Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks.
Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering.
Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode.
End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows.
Summary of Changes
Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead.
Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering.
Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates.
Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency.
Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation.
Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically.
vLLM version: v0.20.1