Skip to content

[Performance]Reduce sampling#8308

Merged
weijinqian0 merged 11 commits into
vllm-project:mainfrom
hzx55906:reduce_sampling
May 25, 2026
Merged

[Performance]Reduce sampling#8308
weijinqian0 merged 11 commits into
vllm-project:mainfrom
hzx55906:reduce_sampling

Conversation

@hzx55906

@hzx55906 hzx55906 commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios.

In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy.

Does this PR introduce any user-facing change?

No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default.

How was this patch tested?

Tested with vLLM version based on v0.20.1.

  • Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks.

  • Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering.

  • Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode.

  • End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows.

Summary of Changes

  • Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead.

  • Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering.

  • Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates.

  • Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency.

  • Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation.

  • Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically.

  • vLLM version: v0.20.1

@github-actions

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the sampling process within a distributed (Tensor Parallel) environment, termed 'reduce sampling'. The primary goal is to enhance performance by minimizing communication overhead during token selection. This is achieved through specialized implementations for greedy, top-k, and top-p sampling that intelligently aggregate and process vocabulary information across parallel ranks. The changes also adapt the speculative decoding rejection sampling mechanism to work seamlessly with this new, more efficient distributed sampling approach, and integrate these optimizations into the model's logits computation and overall sampling workflow.

Highlights

  • Optimized Distributed Sampling Scheme: Introduced a new configuration option, enable_reduce_sample, in AscendConfig to activate an optimized sampling scheme designed for distributed environments (Tensor Parallelism). This scheme aims to reduce communication overhead during sampling.
  • Distributed Greedy Sampling: Implemented a new greedy_sample function that performs distributed greedy sampling by gathering local maximum logits and their global indices across tensor parallel ranks to determine the global argmax, thereby reducing communication compared to full vocabulary gathering.
  • Distributed Top-K/Top-P Sampling: Enhanced top-k and top-p sampling logic to operate efficiently in a distributed setting. This involves performing local top-k selection, gathering these candidates and their global indices across all tensor parallel ranks, and then applying top-p filtering on the globally combined set of candidates.
  • Updated Rejection Sampling for Compressed Vocabulary: Modified the core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, sample_recovered_tokens_kernel, and their PyTorch counterparts) to support a 'compressed vocabulary' mode. This allows the sampling process to work with a reduced set of candidate tokens and their corresponding global indices, improving efficiency.
  • Model-Specific Logits Handling: Patched the compute_logits method for Eagle3LlamaForCausalLM to conditionally return only the selected draft token IDs (and a bias) when the reduce sampling scheme is enabled, instead of full logits, further optimizing the draft generation process.
  • Integration into Sampling Pipeline: Integrated the new AscendRejectionSampler and AscendTopKTopPSampler into the ModelRunnerV1's sampling pipeline. This includes adding prepare_sampling calls to dynamically pass max_topk values, ensuring the distributed sampling logic is correctly applied during both regular and speculative decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Ops][Feature] Implement optimized reduce sampling scheme for Ascend NPUs

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements an optimized "reduce sampling" scheme for Ascend NPUs to enhance performance in distributed speculative decoding. It introduces the `AscendRejectionSampler` and updates the sampling pipeline to use a compressed logit flow (top-k -> allgather -> top-p), significantly reducing inter-node communication by avoiding full vocabulary gathers. Key changes include updates to Triton kernels for rejection sampling and bincount to handle tensor parallel ranks, and a new configuration flag `enable_reduce_sample`.

Feedback: Several critical issues were identified during review. The Triton kernels in `reject_sample.py` incorrectly use the batch `BLOCK_SIZE` for vocabulary indexing, which will fail when the compressed vocabulary size exceeds the batch size. Additionally, `AscendVocabParallelEmbedding` contains an `AttributeError` and incorrect slicing logic that ignores padding. Finally, the patch for `Eagle3LlamaForCausalLM` introduces a breaking change to the `compute_logits` API return type.

### Does this PR introduce _any_ user-facing change?
Yes, it introduces the `enable_reduce_sample` configuration option and modifies the internal return signature of `compute_logits` for Eagle3 models, which may affect custom integrations.

### How was this patch tested?
CI tests should be performed to ensure the new sampling flow maintains accuracy. Specific attention should be paid to verifying the Triton kernels with compressed vocabulary sizes larger than the batch size.

Comment thread vllm_ascend/ops/triton/reject_sample.py Outdated
Comment thread vllm_ascend/ops/triton/reject_sample.py Outdated
Comment thread vllm_ascend/ops/vocab_parallel_embedding.py Outdated
Comment thread vllm_ascend/patch/worker/patch_llama_eagle3.py Outdated
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@hzx55906 hzx55906 changed the title Reduce sampling [Performance]Reduce sampling Apr 16, 2026
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@weijinqian0

Copy link
Copy Markdown
Collaborator

need avoid to amend related function in reject_sample.


draft_token_ids = logits.argmax(dim=-1)
logits = self.model.compute_logits(sample_hidden_states, get_ascend_config().enable_reduce_sample)
if not get_ascend_config().enable_reduce_sample:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to enable this feature by default, I think this brings a general performance gain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have set this optimization to be enabled by default, so no manual activation is required in the settings.

else:
last_hidden_states, hidden_states = ret_hidden_states

if self.method != "dflash":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether the DFlash needs to be determined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. This optimization also applies to DFlash.

multi_steps_attn_metadata = [MagicMock(), MagicMock(), MagicMock()]

mock_ascend_config = MagicMock()
mock_ascend_config.enable_reduce_sample = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario where reduce_sample is true is added, and assert is modified or added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After enabling this optimization by default, the issue of incomplete UT test coverage has been resolved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether reduce_sample and lmhead can be enabled at the same time. The logic of the judgment in this line is inconsistent with that in line 930.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. This optimization can still be enabled in the lmhead_tp_enable scenario by simply delaying the all-to-all communication following the lmhead layer.

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@linfeng-yuan linfeng-yuan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
with 3 questions:

  1. add an e2e case with mtp and sampling metdata
  2. answer Jinqian's question about implementation of rej_sampler
  3. update the reason of patch and elimination plan in vllm_ascend/patch/__init__.py

Signed-off-by: hzx55906 <513464215@qq.com>
hzx55906 added 10 commits May 22, 2026 10:43
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: hzx55906 <513464215@qq.com>
Comment on lines +155 to +156
vocab_size, # vocab_size or selected_vocab_size if ENABLE_REDUCE_SAMPLING
global_vocab_size, # global vocab size for draft_probs indexing (only used if ENABLE_REDUCE_SAMPLING)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the suggestions from @weijinqian0 has not been fixed now?

@weijinqian0 weijinqian0 merged commit 526141c into vllm-project:main May 25, 2026
55 checks passed
zhao-stack pushed a commit to zhao-stack/vllm-ascend that referenced this pull request May 25, 2026
zhao-stack pushed a commit to zhao-stack/vllm-ascend that referenced this pull request May 26, 2026
wangxiyuan pushed a commit that referenced this pull request May 26, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
zzzzzmeng pushed a commit to zzzzzmeng/vllm-ascend that referenced this pull request May 28, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Biuapha pushed a commit to Biuapha/vllm-ascend that referenced this pull request May 30, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>
anysources pushed a commit to anysources/vllm-ascend that referenced this pull request May 30, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: wenke <tangzhengzhi@huawei.com>
yilunh998 pushed a commit to yilunh998/vllm-ascend that referenced this pull request Jun 2, 2026
What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in
a distributed (Tensor Parallel) environment, named "reduce sampling".
The core goal is to minimize communication overhead during token
selection, thereby improving the performance of vLLM Ascend in
distributed deployment scenarios.

In traditional distributed sampling, full vocabulary gathering across
tensor parallel ranks leads to high communication costs, especially in
large-scale parallel settings. This optimization addresses the issue by
designing specialized sampling logic that intelligently aggregates and
processes vocabulary information, reducing unnecessary data transmission
while ensuring sampling accuracy.

Does this PR introduce any user-facing change?

No breaking user-facing changes. A new configuration option
enable_reduce_sample is added to AscendConfig, which users can
optionally enable to activate the optimized distributed sampling scheme.
Existing sampling behaviors remain unchanged by default.

How was this patch tested?

Tested with vLLM version based on v0.20.1.
- Validation of distributed greedy sampling logic, ensuring correct
global argmax selection across tensor parallel ranks.

- Verification of distributed top-k/top-p sampling, confirming accurate
candidate aggregation and filtering.

- Functional testing of speculative decoding rejection sampling with the
new compressed vocabulary mode.

- End-to-end testing of the integrated sampling pipeline in
ModelRunnerV1, covering both regular and speculative decoding workflows.

Summary of Changes

- Optimized Distributed Sampling Scheme: Added enable_reduce_sample in
AscendConfig to toggle the optimized sampling mode for Tensor
Parallelism, reducing communication overhead.

- Distributed Greedy Sampling: Implemented a new greedy_sample function
that gathers local maximum logits and their global indices across ranks
to determine the global argmax, avoiding full vocabulary gathering.

- Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to
perform local top-k selection first, then gather candidates across ranks
and apply top-p filtering on the combined global candidates.

- Updated Rejection Sampling: Modified core rejection sampling functions
(rejection_sample, rejection_random_sample_kernel, etc.) to support
compressed vocabulary mode, working with reduced candidate tokens and
global indices for efficiency.

- Model-Specific Logits Handling: Patched
Eagle3LlamaForCausalLM.compute_logits to return only selected draft
token IDs (and bias) when reduce sampling is enabled, optimizing draft
generation.

- Pipeline Integration: IntegratedAscendRejectionSampler and
AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding
prepare_sampling calls to pass max_topk dynamically.

- vLLM version: v0.20.1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: yilunh <hanyilun1@huawei.com>
yilunh998 pushed a commit to yilunh998/vllm-ascend that referenced this pull request Jun 2, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Signed-off-by: yilunh <hanyilun1@huawei.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in
a distributed (Tensor Parallel) environment, named "reduce sampling".
The core goal is to minimize communication overhead during token
selection, thereby improving the performance of vLLM Ascend in
distributed deployment scenarios.

In traditional distributed sampling, full vocabulary gathering across
tensor parallel ranks leads to high communication costs, especially in
large-scale parallel settings. This optimization addresses the issue by
designing specialized sampling logic that intelligently aggregates and
processes vocabulary information, reducing unnecessary data transmission
while ensuring sampling accuracy.

Does this PR introduce any user-facing change?

No breaking user-facing changes. A new configuration option
enable_reduce_sample is added to AscendConfig, which users can
optionally enable to activate the optimized distributed sampling scheme.
Existing sampling behaviors remain unchanged by default.

How was this patch tested?

Tested with vLLM version based on v0.20.1. 
- Validation of distributed greedy sampling logic, ensuring correct
global argmax selection across tensor parallel ranks.

- Verification of distributed top-k/top-p sampling, confirming accurate
candidate aggregation and filtering.

- Functional testing of speculative decoding rejection sampling with the
new compressed vocabulary mode.

- End-to-end testing of the integrated sampling pipeline in
ModelRunnerV1, covering both regular and speculative decoding workflows.

Summary of Changes

- Optimized Distributed Sampling Scheme: Added enable_reduce_sample in
AscendConfig to toggle the optimized sampling mode for Tensor
Parallelism, reducing communication overhead.

- Distributed Greedy Sampling: Implemented a new greedy_sample function
that gathers local maximum logits and their global indices across ranks
to determine the global argmax, avoiding full vocabulary gathering.

- Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to
perform local top-k selection first, then gather candidates across ranks
and apply top-p filtering on the combined global candidates.

- Updated Rejection Sampling: Modified core rejection sampling functions
(rejection_sample, rejection_random_sample_kernel, etc.) to support
compressed vocabulary mode, working with reduced candidate tokens and
global indices for efficiency.

- Model-Specific Logits Handling: Patched
Eagle3LlamaForCausalLM.compute_logits to return only selected draft
token IDs (and bias) when reduce sampling is enabled, optimizing draft
generation.

- Pipeline Integration: IntegratedAscendRejectionSampler and
AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding
prepare_sampling calls to pass max_topk dynamically.


- vLLM version: v0.20.1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in
a distributed (Tensor Parallel) environment, named "reduce sampling".
The core goal is to minimize communication overhead during token
selection, thereby improving the performance of vLLM Ascend in
distributed deployment scenarios.

In traditional distributed sampling, full vocabulary gathering across
tensor parallel ranks leads to high communication costs, especially in
large-scale parallel settings. This optimization addresses the issue by
designing specialized sampling logic that intelligently aggregates and
processes vocabulary information, reducing unnecessary data transmission
while ensuring sampling accuracy.

Does this PR introduce any user-facing change?

No breaking user-facing changes. A new configuration option
enable_reduce_sample is added to AscendConfig, which users can
optionally enable to activate the optimized distributed sampling scheme.
Existing sampling behaviors remain unchanged by default.

How was this patch tested?

Tested with vLLM version based on v0.20.1. 
- Validation of distributed greedy sampling logic, ensuring correct
global argmax selection across tensor parallel ranks.

- Verification of distributed top-k/top-p sampling, confirming accurate
candidate aggregation and filtering.

- Functional testing of speculative decoding rejection sampling with the
new compressed vocabulary mode.

- End-to-end testing of the integrated sampling pipeline in
ModelRunnerV1, covering both regular and speculative decoding workflows.

Summary of Changes

- Optimized Distributed Sampling Scheme: Added enable_reduce_sample in
AscendConfig to toggle the optimized sampling mode for Tensor
Parallelism, reducing communication overhead.

- Distributed Greedy Sampling: Implemented a new greedy_sample function
that gathers local maximum logits and their global indices across ranks
to determine the global argmax, avoiding full vocabulary gathering.

- Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to
perform local top-k selection first, then gather candidates across ranks
and apply top-p filtering on the combined global candidates.

- Updated Rejection Sampling: Modified core rejection sampling functions
(rejection_sample, rejection_random_sample_kernel, etc.) to support
compressed vocabulary mode, working with reduced candidate tokens and
global indices for efficiency.

- Model-Specific Logits Handling: Patched
Eagle3LlamaForCausalLM.compute_logits to return only selected draft
token IDs (and bias) when reduce sampling is enabled, optimizing draft
generation.

- Pipeline Integration: IntegratedAscendRejectionSampler and
AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding
prepare_sampling calls to pass max_topk dynamically.


- vLLM version: v0.20.1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
### What this PR does / why we need it?

Fix the issue in reduce_sampling where enabling speculative sampling
causes an error with a single curl request.
vllm-project#8308

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.20.2
- vLLM main:
vllm-project/vllm@1ac10f1

---------

Signed-off-by: hzx55906 <513464215@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants