fix(lora): sync with vLLM 0.18.0 and update LoRA tests#504
Open
fix(lora): sync with vLLM 0.18.0 and update LoRA tests#504
Conversation
…appings through execute model pipeline for ngram specdec
* fix bug related to sampler warm-up for pp Signed-off-by: wonsub kim <subang0@rebellions.ai> * apply formatting and add FIXME comment --------- Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-authored-by: wonsub kim <subang0@rebellions.ai>
…499) Problem: When a new prefill request is scheduled, the RBLN scheduler kicks out all running decode requests and restores the full token budget. However, num_new_tokens was clipped using the already-reduced token_budget before the kick-out, causing the first prefill chunk to be short by the number of tokens consumed by decode requests (e.g. 127 instead of 128). This off-by-one misaligned all subsequent chunk positions, eventually triggering a device runtime abort (SYS_TASK_ABORTED). Solution: Use prefill_token_budget instead of token_budget when computing num_new_tokens for new prefill requests.
…h tests (#491) Co-authored-by: Huijong JEONG <huijong.jeong@squeezebits.com>
Signed-off-by: Jinseok Lee <jindol21@rebellions.ai> Co-authored-by: Jaehwang Jung <jaehwang.jung@rebellions.ai>
Re-applies logic removed in 3495a41; pipeline parallel depends on it.
…e unit test directory
Collaborator
|
@rebel-jinhwan Could you check if any additional review is needed? |
1c6f65a to
0c81ab6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
This PR focuses on LoRA test coverage and validation for the RBLN torch-compile path.
vllm_rbln/lora/layer.pyto realign the RBLN LoRA embedding path with the current upstream implementation while preserving the RBLN-specific input-shape handling.tests/torch_compile/e2e/v1/lora/test_basic_lora.pyusing a single SQL prompt.examples/experimental/run_lora_test.pyto use:meta-llama/Llama-3.2-3B-Instructjeeejeee/llama32-3b-text2sql-spiderVLLM_RBLN_ENFORCE_MODEL_FP32=1VLLM_RBLN_ENABLE_WARM_UP=0VLLM_RBLN_USE_VLLM_MODEL=1VLLM_DISABLE_COMPILE_CACHE=0📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
pytest tests/torch_compile/unit/v1/lorapytest tests/torch_compile/e2e/v1/lora/test_basic_lora.pypython examples/experimental/run_lora_test.pyVerify output:
25 failed, 111 passed.Known failures:
test_embeddingsfails intest_layers.py.torch.compile(..., backend="rbln").test_lora_functions.pyfails locally with:RuntimeError: RBLNRuntimeError: RBLN_DEVICES environment variable changed at runtime. Initial value: , Current value: 0📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes