fix: align sampler batch size with decode batch size by rebel-daeyang · Pull Request #474 · RBLN-SW/vllm-rbln

rebel-daeyang · 2026-03-24T08:33:10Z

🚀 Summary of Changes

Previously, the decoder output logits were sliced before being passed to the sampler. However, performing slicing operations on rbln device tensors triggers unnecessary d2h and h2d transfers, leading to a significant increase in latency.

To resolve this performance issue, this PR removes the logits slicing logic and updates the sampler to accept the full logits, matching the decode batch size.

📌 Related Issues / Tickets

Resolves #
Related to #

✅ Type of Change

🚀 Release (release)
✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

Run ...
Verify output: ...
Edge case tested: ...

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

Verified outputs for 1~64 batch size, temperature/top_p/top_k, chunked prefill with llama3.2 model.
Verified outputs for medusa, lora (eagle3 fails as before).
rbln device tensor input does not yet support speculative decoding or LoRA.
Request for Review: Please check for potential issues regarding Speculative Decoding, LoRA, or MoE models.

…RBLN-SW/vllm-rbln into sampler_batch_size_to_decode_batch_size

rebel-ykchoi · 2026-03-26T08:15:54Z

vllm_rbln/v1/worker/rbln_model_runner.py

        # compile sampler for all possible decode batches
        max_decode_batch = self.bucketing_manager.decode_batch_buckets[-1]
        for decode_batch in range(1, max_decode_batch + 1):
            dummy_decode_requests = []
            dummy_decode_num_scheduled_tokens = {}
            for _ in range(decode_batch):
                self._add_dummy_requests(
                    requests=dummy_decode_requests,
                    num_scheduled_tokens=dummy_decode_num_scheduled_tokens,
                    total_tokens=decode_max_seq_len,
                    num_computed_tokens=decode_max_seq_len,
                    num_kv_cache_groups=num_kv_cache_groups,
                    sampling_params=None
                    if self.is_pooling_model
                    else SamplingParams(temperature=0.0),
                    pooling_params=PoolingParams(
                        task=self.get_supported_pooling_tasks()[0]
                    )
                    if self.is_pooling_model
                    else None,
                )
            so, cso = self._make_dummy_scheduler_outputs(
                dummy_decode_requests,
                dummy_decode_num_scheduled_tokens,
                num_kv_cache_groups,
            )
            current_intermediate_tensors = self.decode_intermediate_tensors.get(
                decode_batch
            )
            assert current_intermediate_tensors is not None
            self._execute_dummy_requests(so, cso, current_intermediate_tensors)


Please, remove the code for compiling the sampler every num_tokens in the decode stage as it is no longer necessary.

rebel-daeyang · 2026-03-27T04:50:45Z

I noticed that applying this PR removes logits slicing for CPU tensor inputs, which leads to a decrease in sampler performance.
I will revisit this once the performance of device tensor inputs surpasses the current CPU tensor input performance.

feat: sampler batch size to decode batch size

713b708

rebel-daeyang requested review from huijjj, rebel-eunji, rebel-jiwoopark, rebel-jonghewk, rebel-wonsubkim and rebel-ykchoi March 24, 2026 08:33

rebel-daeyang changed the title ~~fix: align sampler batch size with decode batch size~~ [WIP] fix: align sampler batch size with decode batch size Mar 24, 2026

rebel-daeyang changed the title ~~[WIP] fix: align sampler batch size with decode batch size~~ (WIP) fix: align sampler batch size with decode batch size Mar 24, 2026

rebel-daeyang added 2 commits March 24, 2026 18:02

fix ruff

ac5676c

fix

b584dd9

rebel-daeyang changed the title ~~(WIP) fix: align sampler batch size with decode batch size~~ fix: align sampler batch size with decode batch size Mar 25, 2026

rebel-daeyang added 3 commits March 25, 2026 11:36

Merge branch 'dev' into sampler_batch_size_to_decode_batch_size

eb2932c

fix: sampling metadata pad

f4da84a

Merge branch 'sampler_batch_size_to_decode_batch_size' of github.com:…

66464c7

…RBLN-SW/vllm-rbln into sampler_batch_size_to_decode_batch_size

rebel-ykchoi requested changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: align sampler batch size with decode batch size#474

fix: align sampler batch size with decode batch size#474
rebel-daeyang wants to merge 6 commits intodevfrom
sampler_batch_size_to_decode_batch_size

rebel-daeyang commented Mar 24, 2026 •

edited

Loading

Uh oh!

rebel-ykchoi Mar 26, 2026

Uh oh!

rebel-daeyang commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rebel-daeyang commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

rebel-ykchoi Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rebel-daeyang commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rebel-daeyang commented Mar 24, 2026 •

edited

Loading