feature: enable P/D disaggregation with NIXL host KV transfer#477
Draft
rebel-ykchoi wants to merge 2 commits intodevfrom
Draft
feature: enable P/D disaggregation with NIXL host KV transfer#477rebel-ykchoi wants to merge 2 commits intodevfrom
rebel-ykchoi wants to merge 2 commits intodevfrom
Conversation
wire vLLM KV transfer to a RBLN-specific NIXL connector and host-side
buffers so prefill/decode can run on separate engines with H2H transfer.
KV connector / registration
- add RblnNixlConnector (scheduler/worker) extending upstream NixlConnector:
- register connector name "RblnNixlConnector" in kv_connector factory.
Platform
- expose NIXL hints: get_nixl_supported_devices (rbln -> cpu) and
get_nixl_memory_type ("DRAM").
Scheduler (rbln_scheduler.py)
- handle kv_consumer request to be scheduled with other requests in decode
stage
Model runner (rbln_model_runner.py)
- override maybe_get_kv_connector_output(..., wait_for_save)
using last prefill chunk.
- replace generic copy_kv_blocks with rbln_copy_kv_blocks using runtime
_update_kv_cache / _fetch_kv_cache
- bind_kv_cache_name + per-layer names for mark_static_address when compiling.
Attention backend (flash_attention.py)
- Report backend name as FLASH_ATTN for upstream compatibility.
Examples
- add experimental examples/experimental/pd_disaggregation/toy_proxy_server.py
(FastAPI proxy routing chat completions to prefill vs decode HTTP backends).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
wire vLLM KV transfer to a RBLN-specific NIXL connector and host-side buffers so prefill/decode can run on separate engines with H2H transfer.
KV connector / registration
Platform
Scheduler (rbln_scheduler.py)
Model runner (rbln_model_runner.py)
Attention backend (flash_attention.py)
Examples
📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
.........📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes