Skip to content

fix: align sampler batch size with decode batch size#474

Open
rebel-daeyang wants to merge 6 commits intodevfrom
sampler_batch_size_to_decode_batch_size
Open

fix: align sampler batch size with decode batch size#474
rebel-daeyang wants to merge 6 commits intodevfrom
sampler_batch_size_to_decode_batch_size

Conversation

@rebel-daeyang
Copy link
Copy Markdown

@rebel-daeyang rebel-daeyang commented Mar 24, 2026

🚀 Summary of Changes

Previously, the decoder output logits were sliced before being passed to the sampler. However, performing slicing operations on rbln device tensors triggers unnecessary d2h and h2d transfers, leading to a significant increase in latency.

To resolve this performance issue, this PR removes the logits slicing logic and updates the sampler to accept the full logits, matching the decode batch size.


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes

Verified outputs for 1~64 batch size, temperature/top_p/top_k, chunked prefill with llama3.2 model.
Verified outputs for medusa, lora (eagle3 fails as before).
rbln device tensor input does not yet support speculative decoding or LoRA.
Request for Review: Please check for potential issues regarding Speculative Decoding, LoRA, or MoE models.


@rebel-daeyang rebel-daeyang changed the title fix: align sampler batch size with decode batch size [WIP] fix: align sampler batch size with decode batch size Mar 24, 2026
@rebel-daeyang rebel-daeyang changed the title [WIP] fix: align sampler batch size with decode batch size (WIP) fix: align sampler batch size with decode batch size Mar 24, 2026
@rebel-daeyang rebel-daeyang changed the title (WIP) fix: align sampler batch size with decode batch size fix: align sampler batch size with decode batch size Mar 25, 2026
Comment on lines 1926 to 1956
# compile sampler for all possible decode batches
max_decode_batch = self.bucketing_manager.decode_batch_buckets[-1]
for decode_batch in range(1, max_decode_batch + 1):
dummy_decode_requests = []
dummy_decode_num_scheduled_tokens = {}
for _ in range(decode_batch):
self._add_dummy_requests(
requests=dummy_decode_requests,
num_scheduled_tokens=dummy_decode_num_scheduled_tokens,
total_tokens=decode_max_seq_len,
num_computed_tokens=decode_max_seq_len,
num_kv_cache_groups=num_kv_cache_groups,
sampling_params=None
if self.is_pooling_model
else SamplingParams(temperature=0.0),
pooling_params=PoolingParams(
task=self.get_supported_pooling_tasks()[0]
)
if self.is_pooling_model
else None,
)
so, cso = self._make_dummy_scheduler_outputs(
dummy_decode_requests,
dummy_decode_num_scheduled_tokens,
num_kv_cache_groups,
)
current_intermediate_tensors = self.decode_intermediate_tensors.get(
decode_batch
)
assert current_intermediate_tensors is not None
self._execute_dummy_requests(so, cso, current_intermediate_tensors)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove the code for compiling the sampler every num_tokens in the decode stage as it is no longer necessary.

@rebel-daeyang
Copy link
Copy Markdown
Author

I noticed that applying this PR removes logits slicing for CPU tensor inputs, which leads to a decrease in sampler performance.
I will revisit this once the performance of device tensor inputs surpasses the current CPU tensor input performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants