(WIP) feat(core): allow multiple prefill requests to be scheduled in a chunk #88

kkimmk · 2025-09-24T08:35:07Z

🚀 Summary of Changes

This PR allows multiple prefill requests to be scheduled at a time.
This scheduling policy can enhance compute utilization by filling all the input tokens with valid tokens instead of padding tokens.
It's partially implemented; Kernel implementations should be modified accordingly.

📌 Related Issues / Tickets

Related to [core] Allow multiple prefill requests to be scheduled in a chunk #86

✅ Type of Change

✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (bug-fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

With this branch, the overall process of multiple-prefills-in-a-chunk, including the inputs for the prefill attention kernel, can be analyzed. I guess the values for slot_mapping can be modified depending on the kernel implementation.

Run RBLN_KERNEL_MODE=triton USE_VLLM_MODEL=1 VLLM_DISABLE_COMPILE_CACHE=1 USE_VLLM_V1=0 FLASH_CAUSAL_ATTN=0 python examples/experimental/offline_inference_basic.py
Verify output: With the latest compiler/kernels, an error occurs rather than generating wrong outputs. Please refer to the Notes below.

CS_GEN 
 --> L__self___model_model_layers__modules__0___input_layernorm__forward_method___self___weight_0_0


CS_GEN 
 --> L__self___model_model_layers__modules__0___self_attn_qkv_proj_weight_0_0_0

terminate called after throwing an instance of 'rbln::RuntimeError'
  what():  RBLNRuntimeError: CS_GEN 
Aborted (core dumped)

Running with FLASH_CAUSAL_ATTN, which is the default, will use the original scheduling policy where only 1 prefill request is scheduled at a time.

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

In the previous meeting, I've shared some outputs generated by this feature implementation, which indicates that it ran without errors even though the functionality was incorrect. Those results can be reproduced with earlier versions of the compiler/kernels. With the recent versions, it seems that the prefill attention kernel simply raises an error. Please refer to this note for the results obtained with earlier versions of the compiler/kernels.

feat: allow multiple prefill requests to be scheduled in a chunk

e84f59d

kkimmk requested review from rebel-jindol21, rebel-jiwoopark and rebel-wonsubkim September 24, 2025 08:35

kkimmk assigned kkimmk and unassigned kkimmk Sep 24, 2025

rebel-jiwoopark added the torch.compile torch.compile based implementation label Sep 26, 2025

rebel-jiwoopark changed the base branch from main to dev October 21, 2025 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(WIP) feat(core): allow multiple prefill requests to be scheduled in a chunk #88

(WIP) feat(core): allow multiple prefill requests to be scheduled in a chunk #88

Uh oh!

kkimmk commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

(WIP) feat(core): allow multiple prefill requests to be scheduled in a chunk #88

Are you sure you want to change the base?

(WIP) feat(core): allow multiple prefill requests to be scheduled in a chunk #88

Uh oh!

Conversation

kkimmk commented Sep 24, 2025

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants