[Performance]Reduce sampling by hzx55906 · Pull Request #8308 · vllm-project/vllm-ascend

hzx55906 · 2026-04-15T10:35:49Z

What this PR does / why we need it?
This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios.

In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy.

Does this PR introduce any user-facing change?

No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default.

How was this patch tested?

Tested with vLLM version based on v0.20.1.

Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks.
Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering.
Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode.
End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows.

Summary of Changes

Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead.
Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering.
Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates.
Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency.
Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation.
Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically.
vLLM version: v0.20.1

github-actions · 2026-04-15T10:36:04Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-04-15T10:37:17Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the sampling process within a distributed (Tensor Parallel) environment, termed 'reduce sampling'. The primary goal is to enhance performance by minimizing communication overhead during token selection. This is achieved through specialized implementations for greedy, top-k, and top-p sampling that intelligently aggregate and process vocabulary information across parallel ranks. The changes also adapt the speculative decoding rejection sampling mechanism to work seamlessly with this new, more efficient distributed sampling approach, and integrate these optimizations into the model's logits computation and overall sampling workflow.

Highlights

Optimized Distributed Sampling Scheme: Introduced a new configuration option, enable_reduce_sample, in AscendConfig to activate an optimized sampling scheme designed for distributed environments (Tensor Parallelism). This scheme aims to reduce communication overhead during sampling.
Distributed Greedy Sampling: Implemented a new greedy_sample function that performs distributed greedy sampling by gathering local maximum logits and their global indices across tensor parallel ranks to determine the global argmax, thereby reducing communication compared to full vocabulary gathering.
Distributed Top-K/Top-P Sampling: Enhanced top-k and top-p sampling logic to operate efficiently in a distributed setting. This involves performing local top-k selection, gathering these candidates and their global indices across all tensor parallel ranks, and then applying top-p filtering on the globally combined set of candidates.
Updated Rejection Sampling for Compressed Vocabulary: Modified the core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, sample_recovered_tokens_kernel, and their PyTorch counterparts) to support a 'compressed vocabulary' mode. This allows the sampling process to work with a reduced set of candidate tokens and their corresponding global indices, improving efficiency.
Model-Specific Logits Handling: Patched the compute_logits method for Eagle3LlamaForCausalLM to conditionally return only the selected draft token IDs (and a bias) when the reduce sampling scheme is enabled, instead of full logits, further optimizing the draft generation process.
Integration into Sampling Pipeline: Integrated the new AscendRejectionSampler and AscendTopKTopPSampler into the ModelRunnerV1's sampling pipeline. This includes adding prepare_sampling calls to dynamically pass max_topk values, ensuring the distributed sampling logic is correctly applied during both regular and speculative decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Ops][Feature] Implement optimized reduce sampling scheme for Ascend NPUs

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements an optimized "reduce sampling" scheme for Ascend NPUs to enhance performance in distributed speculative decoding. It introduces the `AscendRejectionSampler` and updates the sampling pipeline to use a compressed logit flow (top-k -> allgather -> top-p), significantly reducing inter-node communication by avoiding full vocabulary gathers. Key changes include updates to Triton kernels for rejection sampling and bincount to handle tensor parallel ranks, and a new configuration flag `enable_reduce_sample`.

Feedback: Several critical issues were identified during review. The Triton kernels in `reject_sample.py` incorrectly use the batch `BLOCK_SIZE` for vocabulary indexing, which will fail when the compressed vocabulary size exceeds the batch size. Additionally, `AscendVocabParallelEmbedding` contains an `AttributeError` and incorrect slicing logic that ignores padding. Finally, the patch for `Eagle3LlamaForCausalLM` introduces a breaking change to the `compute_logits` API return type.

### Does this PR introduce _any_ user-facing change?
Yes, it introduces the `enable_reduce_sample` configuration option and modifies the internal return signature of `compute_logits` for Eagle3 models, which may affect custom integrations.

### How was this patch tested?
CI tests should be performed to ensure the new sampling flow maintains accuracy. Specific attention should be paid to verifying the Triton kernels with compressed vocabulary sizes larger than the batch size.

github-actions · 2026-04-16T01:12:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-04-21T01:02:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

weijinqian0 · 2026-05-14T08:57:35Z

need avoid to amend related function in reject_sample.

MengqingCao · 2026-05-14T08:59:41Z

-
-        draft_token_ids = logits.argmax(dim=-1)
+        logits = self.model.compute_logits(sample_hidden_states, get_ascend_config().enable_reduce_sample)
+        if not get_ascend_config().enable_reduce_sample:


I prefer to enable this feature by default, I think this brings a general performance gain.

I have set this optimization to be enabled by default, so no manual activation is required in the settings.

lilinsiman · 2026-05-14T09:04:33Z

        else:
            last_hidden_states, hidden_states = ret_hidden_states

-        if self.method != "dflash":


Check whether the DFlash needs to be determined.

Fixed. This optimization also applies to DFlash.

lilinsiman · 2026-05-14T09:08:08Z

        multi_steps_attn_metadata = [MagicMock(), MagicMock(), MagicMock()]

+        mock_ascend_config = MagicMock()
+        mock_ascend_config.enable_reduce_sample = False


The scenario where reduce_sample is true is added, and assert is modified or added.

After enabling this optimization by default, the issue of incomplete UT test coverage has been resolved.

lilinsiman · 2026-05-14T09:31:06Z

Check whether reduce_sample and lmhead can be enabled at the same time. The logic of the judgment in this line is inconsistent with that in line 930.

Fixed. This optimization can still be enabled in the lmhead_tp_enable scenario by simply delaying the all-to-all communication following the lmhead layer.

github-actions · 2026-05-14T13:49:27Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

linfeng-yuan

LGTM.
with 3 questions:

add an e2e case with mtp and sampling metdata
answer Jinqian's question about implementation of rej_sampler
update the reason of patch and elimination plan in vllm_ascend/patch/__init__.py

Signed-off-by: hzx55906 <513464215@qq.com>

MengqingCao · 2026-05-25T10:44:33Z

+    vocab_size,  # vocab_size or selected_vocab_size if ENABLE_REDUCE_SAMPLING
+    global_vocab_size,  # global vocab size for draft_probs indexing (only used if ENABLE_REDUCE_SAMPLING)


it seems the suggestions from @weijinqian0 has not been fixed now?

This reverts commit 526141c.

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. #8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: XhgAtHuawei <guoxiaohui7@huawei.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: wenke <tangzhengzhi@huawei.com>

What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

What this PR does / why we need it? This PR introduces a significant optimization to the sampling process in a distributed (Tensor Parallel) environment, named "reduce sampling". The core goal is to minimize communication overhead during token selection, thereby improving the performance of vLLM Ascend in distributed deployment scenarios. In traditional distributed sampling, full vocabulary gathering across tensor parallel ranks leads to high communication costs, especially in large-scale parallel settings. This optimization addresses the issue by designing specialized sampling logic that intelligently aggregates and processes vocabulary information, reducing unnecessary data transmission while ensuring sampling accuracy. Does this PR introduce any user-facing change? No breaking user-facing changes. A new configuration option enable_reduce_sample is added to AscendConfig, which users can optionally enable to activate the optimized distributed sampling scheme. Existing sampling behaviors remain unchanged by default. How was this patch tested? Tested with vLLM version based on v0.20.1. - Validation of distributed greedy sampling logic, ensuring correct global argmax selection across tensor parallel ranks. - Verification of distributed top-k/top-p sampling, confirming accurate candidate aggregation and filtering. - Functional testing of speculative decoding rejection sampling with the new compressed vocabulary mode. - End-to-end testing of the integrated sampling pipeline in ModelRunnerV1, covering both regular and speculative decoding workflows. Summary of Changes - Optimized Distributed Sampling Scheme: Added enable_reduce_sample in AscendConfig to toggle the optimized sampling mode for Tensor Parallelism, reducing communication overhead. - Distributed Greedy Sampling: Implemented a new greedy_sample function that gathers local maximum logits and their global indices across ranks to determine the global argmax, avoiding full vocabulary gathering. - Distributed Top-K/Top-P Sampling: Enhanced top-k/top-p logic to perform local top-k selection first, then gather candidates across ranks and apply top-p filtering on the combined global candidates. - Updated Rejection Sampling: Modified core rejection sampling functions (rejection_sample, rejection_random_sample_kernel, etc.) to support compressed vocabulary mode, working with reduced candidate tokens and global indices for efficiency. - Model-Specific Logits Handling: Patched Eagle3LlamaForCausalLM.compute_logits to return only selected draft token IDs (and bias) when reduce sampling is enabled, optimizing draft generation. - Pipeline Integration: IntegratedAscendRejectionSampler and AscendTopKTopPSampler into ModelRunnerV1’s sampling pipeline, adding prepare_sampling calls to pass max_topk dynamically. - vLLM version: v0.20.1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

### What this PR does / why we need it? Fix the issue in reduce_sampling where enabling speculative sampling causes an error with a single curl request. vllm-project#8308 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: vllm-project/vllm@1ac10f1 --------- Signed-off-by: hzx55906 <513464215@qq.com>

hzx55906 requested review from MengqingCao, realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners April 15, 2026 10:35

github-actions Bot added module:ops module:core labels Apr 15, 2026

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread vllm_ascend/ops/triton/reject_sample.py Outdated

Comment thread vllm_ascend/ops/triton/reject_sample.py Outdated

Comment thread vllm_ascend/ops/vocab_parallel_embedding.py Outdated

Comment thread vllm_ascend/patch/worker/patch_llama_eagle3.py Outdated

github-actions Bot added the merge-conflicts label Apr 16, 2026

hzx55906 changed the title ~~Reduce sampling~~ [Performance]Reduce sampling Apr 16, 2026

hzx55906 force-pushed the reduce_sampling branch from 29d3256 to 24ce79d Compare April 16, 2026 02:27

github-actions Bot added merge-conflicts and removed merge-conflicts labels Apr 16, 2026

hzx55906 force-pushed the reduce_sampling branch from c8c82ad to bb3f563 Compare May 12, 2026 02:54

github-actions Bot removed the merge-conflicts label May 12, 2026

weijinqian0 approved these changes May 14, 2026

View reviewed changes

MengqingCao reviewed May 14, 2026

View reviewed changes

lilinsiman reviewed May 14, 2026

View reviewed changes

github-actions Bot added the merge-conflicts label May 14, 2026

hzx55906 force-pushed the reduce_sampling branch from 7cfcb44 to 2a7e47e Compare May 18, 2026 03:06

github-actions Bot removed the merge-conflicts label May 18, 2026

linfeng-yuan approved these changes May 18, 2026

View reviewed changes

github-actions Bot added the merge-conflicts label May 21, 2026

hzx55906 force-pushed the reduce_sampling branch from 6f77638 to 4cc677e Compare May 21, 2026 10:41

github-actions Bot removed the merge-conflicts label May 21, 2026

recover_reduce_sampling

5de3534

Signed-off-by: hzx55906 <513464215@qq.com>

hzx55906 force-pushed the reduce_sampling branch from 10cb0d3 to 5de3534 Compare May 22, 2026 01:21

hzx55906 added 10 commits May 22, 2026 10:43

fix_llm_base_proposer

91a6a47

Signed-off-by: hzx55906 <513464215@qq.com>

remove compressed

9a3874f

Signed-off-by: hzx55906 <513464215@qq.com>

remove compressed

11d5f81

Signed-off-by: hzx55906 <513464215@qq.com>

lmheadtp

d517928

Signed-off-by: hzx55906 <513464215@qq.com>

remove_patch

6dada4c

Signed-off-by: hzx55906 <513464215@qq.com>

remove compressed

1a072e1

Signed-off-by: hzx55906 <513464215@qq.com>

fix lmheadtp

743e6ff

Signed-off-by: hzx55906 <513464215@qq.com>

fix ut

93f0d36

Signed-off-by: hzx55906 <513464215@qq.com>

fix ut

9068292

Signed-off-by: hzx55906 <513464215@qq.com>

fix ut

ae11224

Signed-off-by: hzx55906 <513464215@qq.com>

MengqingCao reviewed May 25, 2026

View reviewed changes

weijinqian0 merged commit 526141c into vllm-project:main May 25, 2026
55 checks passed

zhao-stack pushed a commit to zhao-stack/vllm-ascend that referenced this pull request May 25, 2026

Revert "[Performance]Reduce sampling (vllm-project#8308)"

6bdf84b

This reverts commit 526141c.

hzx55906 mentioned this pull request May 25, 2026

[BugFix] fix reduce_sampling #9545

Merged

zhao-stack pushed a commit to zhao-stack/vllm-ascend that referenced this pull request May 26, 2026

Revert "[Performance]Reduce sampling (vllm-project#8308)"

e9a32d9

This reverts commit 526141c.

		vocab_size, # vocab_size or selected_vocab_size if ENABLE_REDUCE_SAMPLING
		global_vocab_size, # global vocab size for draft_probs indexing (only used if ENABLE_REDUCE_SAMPLING)

Conversation

hzx55906 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

weijinqian0 commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

linfeng-yuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hzx55906 commented Apr 15, 2026 •

edited

Loading

linfeng-yuan left a comment •

edited

Loading