Skip to content

feature: enable P/D disaggregation with NIXL host KV transfer#477

Draft
rebel-ykchoi wants to merge 2 commits intodevfrom
feat_pd_disag
Draft

feature: enable P/D disaggregation with NIXL host KV transfer#477
rebel-ykchoi wants to merge 2 commits intodevfrom
feat_pd_disag

Conversation

@rebel-ykchoi
Copy link
Copy Markdown
Contributor

🚀 Summary of Changes

wire vLLM KV transfer to a RBLN-specific NIXL connector and host-side buffers so prefill/decode can run on separate engines with H2H transfer.

KV connector / registration

  • add RblnNixlConnector (scheduler/worker) extending upstream NixlConnector:
  • register connector name "RblnNixlConnector" in kv_connector factory.

Platform

  • expose NIXL hints: get_nixl_supported_devices (rbln -> cpu) and get_nixl_memory_type ("DRAM").

Scheduler (rbln_scheduler.py)

  • handle kv_consumer request to be scheduled with other requests in decode stage

Model runner (rbln_model_runner.py)

  • override maybe_get_kv_connector_output(..., wait_for_save) using last prefill chunk.
  • replace generic copy_kv_blocks with rbln_copy_kv_blocks using runtime _update_kv_cache / _fetch_kv_cache
  • bind_kv_cache_name + per-layer names for mark_static_address when compiling.

Attention backend (flash_attention.py)

  • report backend name as FLASH_ATTN for upstream compatibility.

Examples

  • add experimental examples/experimental/pd_disaggregation/toy_proxy_server.py (FastAPI proxy routing chat completions to prefill vs decode HTTP backends).

What does this PR do? What feature, fix, or improvement does it bring?


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


wire vLLM KV transfer to a RBLN-specific NIXL connector and host-side
buffers so prefill/decode can run on separate engines with H2H transfer.

KV connector / registration
- add RblnNixlConnector (scheduler/worker) extending upstream NixlConnector:
- register connector name "RblnNixlConnector" in kv_connector factory.

Platform
- expose NIXL hints: get_nixl_supported_devices (rbln -> cpu) and
  get_nixl_memory_type ("DRAM").

Scheduler (rbln_scheduler.py)
- handle kv_consumer request to be scheduled with other requests in decode
stage

Model runner (rbln_model_runner.py)
- override maybe_get_kv_connector_output(..., wait_for_save)
using last prefill chunk.
- replace generic copy_kv_blocks with rbln_copy_kv_blocks using runtime
  _update_kv_cache / _fetch_kv_cache
- bind_kv_cache_name + per-layer names for mark_static_address when compiling.

Attention backend (flash_attention.py)
- Report backend name as FLASH_ATTN for upstream compatibility.

Examples
- add experimental examples/experimental/pd_disaggregation/toy_proxy_server.py
  (FastAPI proxy routing chat completions to prefill vs decode HTTP backends).
@rebel-jiwoopark rebel-jiwoopark added the torch.compile torch.compile based implementation label Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

torch.compile torch.compile based implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants