feature(nixl): support nixl-rbln by rebel-yskim · Pull Request #640 · RBLN-SW/vllm-rbln

rebel-yskim · 2026-06-02T06:48:15Z

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?

📌 Related Issues / Tickets

Resolves #
Related to #

✅ Type of Change

🚀 Release (release)
✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

Run ...
Verify output: ...
Edge case tested: ...

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

Add a sibling of RblnNixlConnector that registers KV cache tensors directly with NIXL through the nixl-rbln adapter, skipping the CPU host-staging buffer and the D2H/H2D bounce copies that the current connector pays per block transfer. Layout: - RblnNixlDirectConnector inherits from RblnNixlConnector, swaps the worker for RblnNixlDirectConnectorWorker. - The worker overrides initialize_host_xfer_buffer (no-op), set_host_xfer_buffer_ops (no-op), register_kv_caches (calls nixl_rbln.register per layer), and deregister_kv_caches. - Not registered in kv_connector/factory.py on purpose — opt-in only, so existing deployments keep the host-bounce path until a consumer flips kv_transfer_config.kv_connector to "RblnNixlDirectConnector". Requires: - nixl-rbln package installed in the worker's venv (late-import check in __init__ produces an actionable error if missing). - NIC capable of ibv_reg_dmabuf_mr (Mellanox CX-5+ on tested firmwares). nixl-rbln/docs/dmabuf-fd-handoff.md tracks the follow-up that would remove this requirement. End-to-end transfer verification against a P/D disagg scenario is out of scope of this commit — the dev host's NICs (Intel E810, Broadcom Thor) decline ibv_reg_dmabuf_mr with EOPNOTSUPP, so a runtime smoke needs to land on a Mellanox host. The class structure and import surface are exercised today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctor Two interim changes toward the device-to-device connector (SR-122): 1. platform.get_nixl_supported_devices -> {"cpu": ("cpu", "rbln")}. RBLN reports device_type="cpu", so NIXL looks up _NIXL_SUPPORTED_DEVICE["cpu"]; the old {"rbln": ("cpu",)} was a dead key (never matched device_type) and left "rbln" kv_buffer rejected with "cpu with rbln kv_buffer is not supported". Now the device-to-device path can pass that gate. "cpu" kv_buffer stays allowed for the host-bounce path. 2. RblnNixlConnectorWorker asserts kv_buffer_device != "rbln" — this connector is host-bounce only; rbln must go through RblnNixlDirectConnector. (See nixl-rbln/docs/direct-connector-impl-map.md -- the assert may move to connector __init__ when the Direct connector lands, so it doesn't collide with a Direct worker that subclasses this one.) The host-bounce path (kv_buffer=cpu) is unchanged and gsm8k-verified. The Direct connector itself (rbln_nixl_direct_connector.py) is still a design template; register_kv_caches override is the remaining work, mapped out in the impl doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make the device-to-device NIXL connector work end-to-end: the prefill NPU KV cache is read directly by decode over RoCE/RDMA, no CPU host-staging bounce. - factory: register RblnNixlDirectConnector (opt-in) - platform: add "rbln" device_type key (device-tensor mode reports device_type="rbln", not "cpu") - rbln_nixl_connector: move the kv_buffer_device guard to the connector __init__ so the direct worker can subclass the host-bounce worker without inheriting its assert - rbln_nixl_direct_connector: register KV-cache vmem directly — resolve the owning RblnContext device id, defer registration until after warm-up (the physical view is allocated by the warm-up forward), register one whole-entry MR per layer (RBLN dma-buf export only accepts allocation-base dvas) while keeping the K/V split in the transfer descriptors - rbln_worker: finalize the deferred KV registration after warm_up_model() Requires VLLM_RBLN_USE_DEVICE_TENSOR=1. Verified: gsm8k P/D exact_match ~0.41 (== cpu baseline), Qwen3-0.6B, TP=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tConnector On a multi-chiplet device (e.g. RBLN-CR03, 4 chiplets) the KV cache is a WITH_TRANSFORM vmem entry sharded across chiplets, so get_device_addrs returns N>1 dvas and the single-device register path raised NotImplementedError. Each chiplet area is an equal-size, structure-preserving 1/N slice (sharded head dim, dim order preserved): a logical region at byte offset `off` with per-block stride `blk` maps, within chiplet c, to area_dva[c] + off//N with stride blk//N. So register one whole-entry MR per area dva and expand each K/V cache_list region into N per-chiplet transfer regions. block_len_per_layer drives the base-class xfer descriptor math unchanged, so chiplet c reads from remote chiplet c (P/D run identical code). N==1 is the original single-device path. Added asserts on divisibility / equal area sizes. Needs nixl_rbln.vaddr_to_dvas() (nixl-rbln) and rebel._C.Context.rbln_ctx_ptr. Verified on Qwen3-0.6B P/D: 112 whole-entry MRs / 224 regions (28 layers x4 chiplets), Num successful transfers=1, coherent greedy output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…connector dva-free Move the vmem-vaddr -> NPU-dva translation and the multi-chiplet shard expansion out of RblnNixlDirectConnectorWorker into nixl-rbln. The connector now only builds logical K/V regions ((tensor, byte_offset, full_block_len)) and calls nixl_rbln.register_kv_regions(), which does vaddr_to_dvas, registers one whole-entry MR per chiplet shard, and returns the chiplet-expanded transfer tables (kv_caches_base_addr / block_len_per_layer). The connector code no longer references NPU dvas or the chiplet count; the RBLN backend plugin is unchanged and the base-class transfer path is already shard-transparent given the tables. Verified on Qwen3-0.6B P/D: "registered 224 transfer region(s) across 4 chiplet shard(s)", Num successful transfers, coherent greedy output (same as before the refactor). Requires nixl-rbln with register_kv_regions(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

In device-tensor mode (which RblnNixlDirectConnector requires) the KV cache is a real `rbln` tensor, so `tensor.get_device()` returns the in-process device id the global RblnContext is keyed on — even under RBLN_DEVICES masking, where the rebel runtime remaps the worker's masked physical id to a local 0-based index (the $RBLN_DEVICES env keeps the *physical* id, which is NOT the context id). Drop the multi-candidate probe (tensor index / $RBLN_DEVICES[0] / 0 / 0..31 scan): it was built on a stale assumption that the cache is a CPU tensor with get_device()==-1, and it even tried the physical $RBLN_DEVICES id, which is the wrong basis. Read the local id off the tensor and just validate a live RblnContext exists there. Verified on Qwen3-0.6B P/D: decode with RBLN_DEVICES=1 resolves to device_id=0 (t.device=rbln:0), 224 regions across 4 shards, Num successful transfers=2, coherent output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…evice-keyed Context lookup The RBLN backend's RblnContext pointer now comes straight from the live model runtime: RBLNModelRunner propagates its runtime_holder via the new set_runtime_holder(), and registration reads runtime_holder[0]._runtime_handle.get_context().rbln_ctx_ptr. This removes the device-id -> global_key_at_device -> from_key Context lookup from the connector entirely (device_id is now just tensor.get_device(), used only for agent metadata). No fallback: if the runtime holder isn't set we hard-fail with a clear assert rather than silently reaching for the global registry. Also inline the trivial _resolve_ctx_device_id (now just get_device()) and shorten the register_kv_caches deferral docstring. Verified on Qwen3-0.6B P/D across device pairs 4/5 and 3/6: both sides log rbln_ctx_ptr source=runtime.get_context(), register 224 regions across 4 chiplet shards, Num successful transfers>0, coherent output. Needs nixl-rbln ensure_rbln_backend/register_kv_regions accepting rbln_ctx_ptr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Comment-only: trim the module docstring, _register_kv_caches_impl docstring, the materialize-physical-view and logical-regions comments. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…up alias Pure rename: `nixl_connector as _up` -> `nixl_connector`; all `_up.` references updated. No behavior change (import smoke-tested). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-02T07:23:03Z

Codecov Report

❌ Patch coverage is 27.96610% with 85 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...sfer/kv_connector/v1/rbln_nixl_direct_connector.py	28.82%	78 Missing and 1 partial ⚠️
vllm_rbln/v1/worker/rbln_worker.py	0.00%	4 Missing and 1 partial ⚠️
...kv_transfer/kv_connector/v1/rbln_nixl_connector.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…otation) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cover the worker's deferred-registration lifecycle, host-bounce removal, connector role delegation, platform NIXL device support, and factory registration. Follows the object.__new__ + mock pattern from test_pd_disaggregation.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rebel-yskim and others added 11 commits May 26, 2026 18:28

style(kv-connector): import upstream module as nixl_connector, drop _…

6260194

…up alias Pure rename: `nixl_connector as _up` -> `nixl_connector`; all `_up.` references updated. No behavior change (import smoke-tested). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fix comments & minor fix

8f78006

minor

67960ec

rebel-yskim self-assigned this Jun 2, 2026

fix(kv-connector): satisfy pre-commit (ruff format, mypy Optional ann…

84b36a7

…otation) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rebel-yskim changed the title ~~ex_feat(nixl): support nixl-rbln~~ feature(nixl): support nixl-rbln Jun 5, 2026

RBLN-SW deleted a comment from github-actions Bot Jun 5, 2026

rebel-yskim and others added 2 commits June 5, 2026 16:01

fix(test): annotate impl_calls for mypy; isort import order

36cdad8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(nixl): support nixl-rbln#640

feature(nixl): support nixl-rbln#640
rebel-yskim wants to merge 14 commits into
devfrom
yunseong/sr-122-add-direct-vmem-connector

rebel-yskim commented Jun 2, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rebel-yskim commented Jun 2, 2026

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 2, 2026 •

edited

Loading