feature(nixl): support nixl-rbln#640
Draft
rebel-yskim wants to merge 14 commits into
Draft
Conversation
Add a sibling of RblnNixlConnector that registers KV cache tensors
directly with NIXL through the nixl-rbln adapter, skipping the CPU
host-staging buffer and the D2H/H2D bounce copies that the current
connector pays per block transfer.
Layout:
- RblnNixlDirectConnector inherits from RblnNixlConnector, swaps
the worker for RblnNixlDirectConnectorWorker.
- The worker overrides initialize_host_xfer_buffer (no-op),
set_host_xfer_buffer_ops (no-op), register_kv_caches (calls
nixl_rbln.register per layer), and deregister_kv_caches.
- Not registered in kv_connector/factory.py on purpose — opt-in
only, so existing deployments keep the host-bounce path until a
consumer flips kv_transfer_config.kv_connector to
"RblnNixlDirectConnector".
Requires:
- nixl-rbln package installed in the worker's venv (late-import
check in __init__ produces an actionable error if missing).
- NIC capable of ibv_reg_dmabuf_mr (Mellanox CX-5+ on tested
firmwares). nixl-rbln/docs/dmabuf-fd-handoff.md tracks the
follow-up that would remove this requirement.
End-to-end transfer verification against a P/D disagg scenario is
out of scope of this commit — the dev host's NICs (Intel E810,
Broadcom Thor) decline ibv_reg_dmabuf_mr with EOPNOTSUPP, so a
runtime smoke needs to land on a Mellanox host. The class structure
and import surface are exercised today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor
Two interim changes toward the device-to-device connector (SR-122):
1. platform.get_nixl_supported_devices -> {"cpu": ("cpu", "rbln")}.
RBLN reports device_type="cpu", so NIXL looks up
_NIXL_SUPPORTED_DEVICE["cpu"]; the old {"rbln": ("cpu",)} was a dead
key (never matched device_type) and left "rbln" kv_buffer rejected
with "cpu with rbln kv_buffer is not supported". Now the
device-to-device path can pass that gate. "cpu" kv_buffer stays
allowed for the host-bounce path.
2. RblnNixlConnectorWorker asserts kv_buffer_device != "rbln" — this
connector is host-bounce only; rbln must go through
RblnNixlDirectConnector. (See nixl-rbln/docs/direct-connector-impl-map.md
-- the assert may move to connector __init__ when the Direct
connector lands, so it doesn't collide with a Direct worker that
subclasses this one.)
The host-bounce path (kv_buffer=cpu) is unchanged and gsm8k-verified.
The Direct connector itself (rbln_nixl_direct_connector.py) is still a
design template; register_kv_caches override is the remaining work,
mapped out in the impl doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the device-to-device NIXL connector work end-to-end: the prefill NPU KV cache is read directly by decode over RoCE/RDMA, no CPU host-staging bounce. - factory: register RblnNixlDirectConnector (opt-in) - platform: add "rbln" device_type key (device-tensor mode reports device_type="rbln", not "cpu") - rbln_nixl_connector: move the kv_buffer_device guard to the connector __init__ so the direct worker can subclass the host-bounce worker without inheriting its assert - rbln_nixl_direct_connector: register KV-cache vmem directly — resolve the owning RblnContext device id, defer registration until after warm-up (the physical view is allocated by the warm-up forward), register one whole-entry MR per layer (RBLN dma-buf export only accepts allocation-base dvas) while keeping the K/V split in the transfer descriptors - rbln_worker: finalize the deferred KV registration after warm_up_model() Requires VLLM_RBLN_USE_DEVICE_TENSOR=1. Verified: gsm8k P/D exact_match ~0.41 (== cpu baseline), Qwen3-0.6B, TP=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tConnector On a multi-chiplet device (e.g. RBLN-CR03, 4 chiplets) the KV cache is a WITH_TRANSFORM vmem entry sharded across chiplets, so get_device_addrs returns N>1 dvas and the single-device register path raised NotImplementedError. Each chiplet area is an equal-size, structure-preserving 1/N slice (sharded head dim, dim order preserved): a logical region at byte offset `off` with per-block stride `blk` maps, within chiplet c, to area_dva[c] + off//N with stride blk//N. So register one whole-entry MR per area dva and expand each K/V cache_list region into N per-chiplet transfer regions. block_len_per_layer drives the base-class xfer descriptor math unchanged, so chiplet c reads from remote chiplet c (P/D run identical code). N==1 is the original single-device path. Added asserts on divisibility / equal area sizes. Needs nixl_rbln.vaddr_to_dvas() (nixl-rbln) and rebel._C.Context.rbln_ctx_ptr. Verified on Qwen3-0.6B P/D: 112 whole-entry MRs / 224 regions (28 layers x4 chiplets), Num successful transfers=1, coherent greedy output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…connector dva-free Move the vmem-vaddr -> NPU-dva translation and the multi-chiplet shard expansion out of RblnNixlDirectConnectorWorker into nixl-rbln. The connector now only builds logical K/V regions ((tensor, byte_offset, full_block_len)) and calls nixl_rbln.register_kv_regions(), which does vaddr_to_dvas, registers one whole-entry MR per chiplet shard, and returns the chiplet-expanded transfer tables (kv_caches_base_addr / block_len_per_layer). The connector code no longer references NPU dvas or the chiplet count; the RBLN backend plugin is unchanged and the base-class transfer path is already shard-transparent given the tables. Verified on Qwen3-0.6B P/D: "registered 224 transfer region(s) across 4 chiplet shard(s)", Num successful transfers, coherent greedy output (same as before the refactor). Requires nixl-rbln with register_kv_regions(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In device-tensor mode (which RblnNixlDirectConnector requires) the KV cache is a real `rbln` tensor, so `tensor.get_device()` returns the in-process device id the global RblnContext is keyed on — even under RBLN_DEVICES masking, where the rebel runtime remaps the worker's masked physical id to a local 0-based index (the $RBLN_DEVICES env keeps the *physical* id, which is NOT the context id). Drop the multi-candidate probe (tensor index / $RBLN_DEVICES[0] / 0 / 0..31 scan): it was built on a stale assumption that the cache is a CPU tensor with get_device()==-1, and it even tried the physical $RBLN_DEVICES id, which is the wrong basis. Read the local id off the tensor and just validate a live RblnContext exists there. Verified on Qwen3-0.6B P/D: decode with RBLN_DEVICES=1 resolves to device_id=0 (t.device=rbln:0), 224 regions across 4 shards, Num successful transfers=2, coherent output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…evice-keyed Context lookup The RBLN backend's RblnContext pointer now comes straight from the live model runtime: RBLNModelRunner propagates its runtime_holder via the new set_runtime_holder(), and registration reads runtime_holder[0]._runtime_handle.get_context().rbln_ctx_ptr. This removes the device-id -> global_key_at_device -> from_key Context lookup from the connector entirely (device_id is now just tensor.get_device(), used only for agent metadata). No fallback: if the runtime holder isn't set we hard-fail with a clear assert rather than silently reaching for the global registry. Also inline the trivial _resolve_ctx_device_id (now just get_device()) and shorten the register_kv_caches deferral docstring. Verified on Qwen3-0.6B P/D across device pairs 4/5 and 3/6: both sides log rbln_ctx_ptr source=runtime.get_context(), register 224 regions across 4 chiplet shards, Num successful transfers>0, coherent output. Needs nixl-rbln ensure_rbln_backend/register_kv_regions accepting rbln_ctx_ptr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment-only: trim the module docstring, _register_kv_caches_impl docstring, the materialize-physical-view and logical-regions comments. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…up alias Pure rename: `nixl_connector as _up` -> `nixl_connector`; all `_up.` references updated. No behavior change (import smoke-tested). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…otation) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the worker's deferred-registration lifecycle, host-bounce removal, connector role delegation, platform NIXL device support, and factory registration. Follows the object.__new__ + mock pattern from test_pd_disaggregation.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
.........📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes