Skip to content

fix(lora): sync with vLLM 0.18.0 and update LoRA tests#504

Open
junstar92 wants to merge 24 commits intodev-0.18from
fix-lora-test
Open

fix(lora): sync with vLLM 0.18.0 and update LoRA tests#504
junstar92 wants to merge 24 commits intodev-0.18from
fix-lora-test

Conversation

@junstar92
Copy link
Copy Markdown
Collaborator

@junstar92 junstar92 commented Apr 2, 2026

🚀 Summary of Changes

This PR focuses on LoRA test coverage and validation for the RBLN torch-compile path.

  • Updated vllm_rbln/lora/layer.py to realign the RBLN LoRA embedding path with the current upstream implementation while preserving the RBLN-specific input-shape handling.
  • Added a basic LoRA E2E smoke test at tests/torch_compile/e2e/v1/lora/test_basic_lora.py using a single SQL prompt.
  • Updated the experimental LoRA example at examples/experimental/run_lora_test.py to use:
    • meta-llama/Llama-3.2-3B-Instruct
    • jeeejeee/llama32-3b-text2sql-spider
  • Applied the same runtime configuration in tests via monkey patching:
    • VLLM_RBLN_ENFORCE_MODEL_FP32=1
    • VLLM_RBLN_ENABLE_WARM_UP=0
    • VLLM_RBLN_USE_VLLM_MODEL=1
    • VLLM_DISABLE_COMPILE_CACHE=0
  • Relaxed TorchDynamo recompile limits in the LoRA unit tests to reduce excessive recompilation and remove fallbacks during test runs.

📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run unit tests: pytest tests/torch_compile/unit/v1/lora
  2. Run the new E2E smoke test: pytest tests/torch_compile/e2e/v1/lora/test_basic_lora.py
  3. Run the example script: python examples/experimental/run_lora_test.py

Verify output:

  • The basic E2E test should produce non-empty outputs for both base and LoRA generation, and the outputs should differ.
  • The example script should run with the updated model/LoRA pair.
  • Known local unit-test result on this branch: 25 failed, 111 passed.

Known failures:

  • test_embeddings fails in test_layers.py.
    • The same math appears to pass in eager mode.
    • The failure seems to happen only after torch.compile(..., backend="rbln").
    • This suggests a compile/lowering issue rather than a pure PyTorch math issue.
  • test_lora_functions.py fails locally with:
    • RuntimeError: RBLNRuntimeError: RBLN_DEVICES environment variable changed at runtime. Initial value: , Current value: 0
    • This does not currently look like a test-logic issue.
    • It may be environment-specific and may only be reproduciable on my side.

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes

  • The example still shows broken or corrupted output in compiled mode.
  • The same example behaves normally in eager mode, so this also appears to be a compile-specific issue.
  • The exact failure point in the compiled path is still unclear and has been difficult to debug from our side.
  • Review and investigation from the Rebellions compiler/runtime side is likely required.

junstar92 and others added 15 commits March 24, 2026 04:26
…appings through execute model pipeline for ngram specdec
* fix bug related to sampler warm-up for pp

Signed-off-by: wonsub kim <subang0@rebellions.ai>

* apply formatting and add FIXME comment

---------

Signed-off-by: wonsub kim <subang0@rebellions.ai>
Co-authored-by: wonsub kim <subang0@rebellions.ai>
…499)

Problem:
When a new prefill request is scheduled, the RBLN scheduler kicks out
all running decode requests and restores the full token budget. However,
num_new_tokens was clipped using the already-reduced token_budget before
the kick-out, causing the first prefill chunk to be short by the number
of tokens consumed by decode requests (e.g. 127 instead of 128).

This off-by-one misaligned all subsequent chunk positions, eventually
triggering a device runtime abort (SYS_TASK_ABORTED).

Solution:
Use prefill_token_budget instead of token_budget when computing
num_new_tokens for new prefill requests.
…h tests (#491)

Co-authored-by: Huijong JEONG <huijong.jeong@squeezebits.com>
@junstar92 junstar92 self-assigned this Apr 2, 2026
@junstar92 junstar92 added the torch.compile torch.compile based implementation label Apr 2, 2026
@rebel-jinhwan
Copy link
Copy Markdown
Contributor

rebel-jinhwan commented Apr 2, 2026

rebel-jindol21 and others added 2 commits April 2, 2026 15:27
Signed-off-by: Jinseok Lee <jindol21@rebellions.ai>
Co-authored-by: Jaehwang Jung <jaehwang.jung@rebellions.ai>
Copy link
Copy Markdown
Collaborator

@rebel-jiwoopark rebel-jiwoopark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rebel-jiwoopark
Copy link
Copy Markdown
Collaborator

@rebel-jinhwan Could you check if any additional review is needed?

@rebel-jiwoopark rebel-jiwoopark self-requested a review April 8, 2026 07:30
@rebel-jiwoopark rebel-jiwoopark force-pushed the dev-0.18 branch 2 times, most recently from 1c6f65a to 0c81ab6 Compare April 9, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

torch.compile torch.compile based implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants