Skip to content

feature(nixl): support nixl-rbln#640

Draft
rebel-yskim wants to merge 14 commits into
devfrom
yunseong/sr-122-add-direct-vmem-connector
Draft

feature(nixl): support nixl-rbln#640
rebel-yskim wants to merge 14 commits into
devfrom
yunseong/sr-122-add-direct-vmem-connector

Conversation

@rebel-yskim

Copy link
Copy Markdown
Contributor

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


rebel-yskim and others added 11 commits May 26, 2026 18:28
Add a sibling of RblnNixlConnector that registers KV cache tensors
directly with NIXL through the nixl-rbln adapter, skipping the CPU
host-staging buffer and the D2H/H2D bounce copies that the current
connector pays per block transfer.

Layout:
  - RblnNixlDirectConnector inherits from RblnNixlConnector, swaps
    the worker for RblnNixlDirectConnectorWorker.
  - The worker overrides initialize_host_xfer_buffer (no-op),
    set_host_xfer_buffer_ops (no-op), register_kv_caches (calls
    nixl_rbln.register per layer), and deregister_kv_caches.
  - Not registered in kv_connector/factory.py on purpose — opt-in
    only, so existing deployments keep the host-bounce path until a
    consumer flips kv_transfer_config.kv_connector to
    "RblnNixlDirectConnector".

Requires:
  - nixl-rbln package installed in the worker's venv (late-import
    check in __init__ produces an actionable error if missing).
  - NIC capable of ibv_reg_dmabuf_mr (Mellanox CX-5+ on tested
    firmwares). nixl-rbln/docs/dmabuf-fd-handoff.md tracks the
    follow-up that would remove this requirement.

End-to-end transfer verification against a P/D disagg scenario is
out of scope of this commit — the dev host's NICs (Intel E810,
Broadcom Thor) decline ibv_reg_dmabuf_mr with EOPNOTSUPP, so a
runtime smoke needs to land on a Mellanox host. The class structure
and import surface are exercised today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor

Two interim changes toward the device-to-device connector (SR-122):

1. platform.get_nixl_supported_devices -> {"cpu": ("cpu", "rbln")}.
   RBLN reports device_type="cpu", so NIXL looks up
   _NIXL_SUPPORTED_DEVICE["cpu"]; the old {"rbln": ("cpu",)} was a dead
   key (never matched device_type) and left "rbln" kv_buffer rejected
   with "cpu with rbln kv_buffer is not supported". Now the
   device-to-device path can pass that gate. "cpu" kv_buffer stays
   allowed for the host-bounce path.

2. RblnNixlConnectorWorker asserts kv_buffer_device != "rbln" — this
   connector is host-bounce only; rbln must go through
   RblnNixlDirectConnector. (See nixl-rbln/docs/direct-connector-impl-map.md
   -- the assert may move to connector __init__ when the Direct
   connector lands, so it doesn't collide with a Direct worker that
   subclasses this one.)

The host-bounce path (kv_buffer=cpu) is unchanged and gsm8k-verified.
The Direct connector itself (rbln_nixl_direct_connector.py) is still a
design template; register_kv_caches override is the remaining work,
mapped out in the impl doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the device-to-device NIXL connector work end-to-end: the prefill NPU
KV cache is read directly by decode over RoCE/RDMA, no CPU host-staging bounce.

- factory: register RblnNixlDirectConnector (opt-in)
- platform: add "rbln" device_type key (device-tensor mode reports
  device_type="rbln", not "cpu")
- rbln_nixl_connector: move the kv_buffer_device guard to the connector
  __init__ so the direct worker can subclass the host-bounce worker without
  inheriting its assert
- rbln_nixl_direct_connector: register KV-cache vmem directly — resolve the
  owning RblnContext device id, defer registration until after warm-up (the
  physical view is allocated by the warm-up forward), register one
  whole-entry MR per layer (RBLN dma-buf export only accepts allocation-base
  dvas) while keeping the K/V split in the transfer descriptors
- rbln_worker: finalize the deferred KV registration after warm_up_model()

Requires VLLM_RBLN_USE_DEVICE_TENSOR=1. Verified: gsm8k P/D exact_match
~0.41 (== cpu baseline), Qwen3-0.6B, TP=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tConnector

On a multi-chiplet device (e.g. RBLN-CR03, 4 chiplets) the KV cache is a
WITH_TRANSFORM vmem entry sharded across chiplets, so get_device_addrs returns
N>1 dvas and the single-device register path raised NotImplementedError.

Each chiplet area is an equal-size, structure-preserving 1/N slice (sharded
head dim, dim order preserved): a logical region at byte offset `off` with
per-block stride `blk` maps, within chiplet c, to area_dva[c] + off//N with
stride blk//N. So register one whole-entry MR per area dva and expand each
K/V cache_list region into N per-chiplet transfer regions. block_len_per_layer
drives the base-class xfer descriptor math unchanged, so chiplet c reads from
remote chiplet c (P/D run identical code). N==1 is the original single-device
path. Added asserts on divisibility / equal area sizes.

Needs nixl_rbln.vaddr_to_dvas() (nixl-rbln) and rebel._C.Context.rbln_ctx_ptr.
Verified on Qwen3-0.6B P/D: 112 whole-entry MRs / 224 regions (28 layers x4
chiplets), Num successful transfers=1, coherent greedy output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…connector dva-free

Move the vmem-vaddr -> NPU-dva translation and the multi-chiplet shard
expansion out of RblnNixlDirectConnectorWorker into nixl-rbln. The connector
now only builds logical K/V regions ((tensor, byte_offset, full_block_len)) and
calls nixl_rbln.register_kv_regions(), which does vaddr_to_dvas, registers one
whole-entry MR per chiplet shard, and returns the chiplet-expanded transfer
tables (kv_caches_base_addr / block_len_per_layer). The connector code no longer
references NPU dvas or the chiplet count; the RBLN backend plugin is unchanged
and the base-class transfer path is already shard-transparent given the tables.

Verified on Qwen3-0.6B P/D: "registered 224 transfer region(s) across 4 chiplet
shard(s)", Num successful transfers, coherent greedy output (same as before the
refactor). Requires nixl-rbln with register_kv_regions().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In device-tensor mode (which RblnNixlDirectConnector requires) the KV cache is
a real `rbln` tensor, so `tensor.get_device()` returns the in-process device id
the global RblnContext is keyed on — even under RBLN_DEVICES masking, where the
rebel runtime remaps the worker's masked physical id to a local 0-based index
(the $RBLN_DEVICES env keeps the *physical* id, which is NOT the context id).

Drop the multi-candidate probe (tensor index / $RBLN_DEVICES[0] / 0 / 0..31
scan): it was built on a stale assumption that the cache is a CPU tensor with
get_device()==-1, and it even tried the physical $RBLN_DEVICES id, which is the
wrong basis. Read the local id off the tensor and just validate a live
RblnContext exists there.

Verified on Qwen3-0.6B P/D: decode with RBLN_DEVICES=1 resolves to device_id=0
(t.device=rbln:0), 224 regions across 4 shards, Num successful transfers=2,
coherent output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…evice-keyed Context lookup

The RBLN backend's RblnContext pointer now comes straight from the live model
runtime: RBLNModelRunner propagates its runtime_holder via the new
set_runtime_holder(), and registration reads
runtime_holder[0]._runtime_handle.get_context().rbln_ctx_ptr. This removes the
device-id -> global_key_at_device -> from_key Context lookup from the connector
entirely (device_id is now just tensor.get_device(), used only for agent
metadata). No fallback: if the runtime holder isn't set we hard-fail with a
clear assert rather than silently reaching for the global registry.

Also inline the trivial _resolve_ctx_device_id (now just get_device()) and
shorten the register_kv_caches deferral docstring.

Verified on Qwen3-0.6B P/D across device pairs 4/5 and 3/6: both sides log
rbln_ctx_ptr source=runtime.get_context(), register 224 regions across 4 chiplet
shards, Num successful transfers>0, coherent output.

Needs nixl-rbln ensure_rbln_backend/register_kv_regions accepting rbln_ctx_ptr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment-only: trim the module docstring, _register_kv_caches_impl docstring, the
materialize-physical-view and logical-regions comments. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…up alias

Pure rename: `nixl_connector as _up` -> `nixl_connector`; all `_up.` references
updated. No behavior change (import smoke-tested).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-yskim rebel-yskim self-assigned this Jun 2, 2026
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 27.96610% with 85 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...sfer/kv_connector/v1/rbln_nixl_direct_connector.py 28.82% 78 Missing and 1 partial ⚠️
vllm_rbln/v1/worker/rbln_worker.py 0.00% 4 Missing and 1 partial ⚠️
...kv_transfer/kv_connector/v1/rbln_nixl_connector.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…otation)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-yskim rebel-yskim changed the title ex_feat(nixl): support nixl-rbln feature(nixl): support nixl-rbln Jun 5, 2026
@RBLN-SW RBLN-SW deleted a comment from github-actions Bot Jun 5, 2026
rebel-yskim and others added 2 commits June 5, 2026 16:01
Cover the worker's deferred-registration lifecycle, host-bounce
removal, connector role delegation, platform NIXL device support,
and factory registration. Follows the object.__new__ + mock pattern
from test_pd_disaggregation.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant