Skip to content

Conversation

@kkimmk
Copy link
Collaborator

@kkimmk kkimmk commented Sep 24, 2025

🚀 Summary of Changes

  • This PR allows multiple prefill requests to be scheduled at a time.
  • This scheduling policy can enhance compute utilization by filling all the input tokens with valid tokens instead of padding tokens.
  • It's partially implemented; Kernel implementations should be modified accordingly.

📌 Related Issues / Tickets


✅ Type of Change

  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (bug-fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

With this branch, the overall process of multiple-prefills-in-a-chunk, including the inputs for the prefill attention kernel, can be analyzed. I guess the values for slot_mapping can be modified depending on the kernel implementation.

  1. Run RBLN_KERNEL_MODE=triton USE_VLLM_MODEL=1 VLLM_DISABLE_COMPILE_CACHE=1 USE_VLLM_V1=0 FLASH_CAUSAL_ATTN=0 python examples/experimental/offline_inference_basic.py
  2. Verify output: With the latest compiler/kernels, an error occurs rather than generating wrong outputs. Please refer to the Notes below.
CS_GEN 
 --> L__self___model_model_layers__modules__0___input_layernorm__forward_method___self___weight_0_0


CS_GEN 
 --> L__self___model_model_layers__modules__0___self_attn_qkv_proj_weight_0_0_0

terminate called after throwing an instance of 'rbln::RuntimeError'
  what():  RBLNRuntimeError: CS_GEN 
Aborted (core dumped)

Running with FLASH_CAUSAL_ATTN, which is the default, will use the original scheduling policy where only 1 prefill request is scheduled at a time.


📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes

In the previous meeting, I've shared some outputs generated by this feature implementation, which indicates that it ran without errors even though the functionality was incorrect. Those results can be reproduced with earlier versions of the compiler/kernels. With the recent versions, it seems that the prefill attention kernel simply raises an error. Please refer to this note for the results obtained with earlier versions of the compiler/kernels.


@kkimmk kkimmk assigned kkimmk and unassigned kkimmk Sep 24, 2025
@rebel-jiwoopark rebel-jiwoopark added the torch.compile torch.compile based implementation label Sep 26, 2025
@rebel-jiwoopark rebel-jiwoopark changed the base branch from main to dev October 21, 2025 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

torch.compile torch.compile based implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants