[serve.llm] Delegate P/D orchestration to the KV-connector backend by kouroshHakha · Pull Request #63950 · ray-project/ray

kouroshHakha · 2026-06-09T06:03:29Z

Why

Make the KV-transfer connector backend the single place that owns prefill/decode (P/D)
request shaping, peer addressing, and handoff discipline — so connectors plug into the P/D
orchestrator without bespoke orchestrator branches. This generalizes the orchestrator beyond a
single hard-coded policy and is the foundation for connectors (e.g. a request-id-addressed,
push-based one) that need pre-dispatch peer binding and/or a concurrent handoff.

What

BaseConnectorBackend defines an abstract P/D protocol:

prepare_prefill_request(*, request, peer) and prepare_decode_request(*, request, peer, prefill_response)
(keyword-only, typed RequestType = ChatCompletionRequest | CompletionRequest).
Two independent policy flags:
- requires_peer_binding — when True, the orchestrator selects the prefill replica first
  (choose_replica) and passes its replica_metadata to the backend as peer (pre-dispatch
  addressing); when False, prefill is dispatched via the standard handle path.
- concurrent_handoff — when True, remote prefill and local decode run concurrently; when
  False, prefill runs to its first chunk before decode starts (sequential).
- These are independent: standard connectors are (False, False); a push-based,
  request-id-addressed connector is (True, True); a pull-based one is (True, False).

The standard (no-peer, sequential) policy lives in a shared DefaultPDProtocolMixin:

NIXL and LMCache backends inherit it (migrated onto the interface).
Multi connector backend delegates prepare_* and the policy flags to its top-most
sub-connector, so that sub-connector's policy governs the group.
A connector with a different protocol overrides prepare_* and opts into the flags.

PDOrchestratorMixin resolves the backend once (cached) via _get_connector_backend() and routes
all connectors' request shaping + handoff through it. The concurrent handoff is a single
_concurrent_decode helper that always drains the remote prefill and cancels it if local decode
doesn't complete (no leaked background prefill).

Stack

Part of a series enabling new KV connectors for Ray Serve LLM P/D; pairs with the per-replica
metadata hook (exposes ReplicaSelection.replica_metadata for pre-dispatch peer binding) and the
engine request-id PR. A MoRIIO connector backend (request-id-addressed) builds on this in a follow-up.

Testing

test_pd_protocol.py: abstract base can't be instantiated; NIXL/LMCache/default expose the
default-mixin shaping (incl. the prefill_response=None guard for concurrent mode); Multi
delegates prepare_*/flags to its top-most sub-connector.
test_prefill_decode_disagg.py / test_factory.py: orchestration routes prefill/decode shaping
through the resolved backend; concurrent-handoff cancels the background prefill on decode failure;
factory resolution + fallback.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request refactors the Prefill/Decode (P/D) orchestration flow to delegate request shaping, peer addressing, and handoff discipline to a resolved KV-connector backend, supporting concurrent handoffs and pre-dispatch peer binding. The review feedback highlights a potential AttributeError when prefill_response is None during concurrent handoffs, resource leaks due to uncancelled background prefill tasks when local decode fails or is cancelled, and fragile static calls to backend instance methods with a None self-reference.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Refactors PDOrchestratorMixin to delegate request shaping, peer addressing, and handoff discipline to BaseConnectorBackend (requires_peer_binding, concurrent_handoff, prepare_prefill_request, prepare_decode_request). The defaults reproduce the existing NIXL/default flow exactly; connectors that need pre-dispatch peer binding (e.g. request-id-addressed transfers) can opt into choose_replica + concurrent handoff without new orchestrator concepts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha

this pr does a half baked job in what is outlined in the description. We need to migrate the existing nixl connector implementation to this new interface in the same pr and do end to end tests. Basically we want the nixl, lmcache and multi connector to all use the new interface for defining the prefill and decode request and routing handlers through the connector backend implementation.

…he/Multi - Make BaseConnectorBackend.prepare_prefill/decode_request abstract; add a shared DefaultPDProtocolMixin with the standard (no-peer, sequential) policy. - NIXL, LMCache, and Multi connector backends now use the interface via the mixin. - Guard prepare_decode_request against a None prefill_response (concurrent mode). - Cancel the background prefill task if local decode fails/cancels (no leak). - Keyword-only, typed prepare_* signatures; Optional[BaseConnectorBackend] return type. - Connector-agnostic protocol docs; keep requires_peer_binding/concurrent_handoff flags (independent: standard=F,F; push-addressed=T,T; pull-addressed=T,F). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2026-06-09T23:54:31Z

Addressed in b2fd6e5 — the existing connectors are now migrated onto the interface in this PR:

BaseConnectorBackend.prepare_prefill_request / prepare_decode_request are now abstract, and the standard (no-peer, sequential) policy lives in a shared DefaultPDProtocolMixin.
NIXL, LMCache, and Multi connector backends all inherit DefaultPDProtocolMixin, so request shaping for every connector now flows through the connector-backend interface (the orchestrator resolves the backend via _get_connector_backend() and routes prefill/decode shaping through it). Multi uses the default policy, with a note that a future custom-shaping sub-connector should delegate to that sub-connector's backend.
A request-id-addressed connector (the follow-up MoRIIO PR) overrides prepare_* and opts into requires_peer_binding/concurrent_handoff.

Plus the gemini-flagged robustness fixes (None-guard on prefill_response, cancel the background prefill task if decode fails/cancels, no more self=None calls) and the typing/kwargs/comment cleanups.

Tests (test_pd_protocol.py + updated test_prefill_decode_disagg.py / test_factory.py): assert the abstract base can't be instantiated, the mixin's shaping (incl. the None-guard), that all four backends (NIXL/LMCache/Multi/Default) route through the interface, the NIXL-backed prefill→decode shaping end-to-end on a mock engine, and the concurrent cancel-on-failure path. CI (premerge go) will exercise the full prefill-decode-disagg suite on a clean checkout.

kouroshHakha

leaving some comments.

…ng, helper - MultiConnectorBackend delegates prepare_*/flags to its top-most sub-connector (rather than inheriting the default mixin), so a sub-connector's policy governs. - Cache the resolved connector backend on the server (no per-request factory call). - Extract the concurrent prefill+decode handoff into a _concurrent_decode helper (dedupes the two paths; cancels the background prefill if decode doesn't finish). - Inline peer=None in the default path; drop the git-history comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.}

cursor · 2026-06-10T01:20:00Z

+                    "on the LLMConfig and no kv_transfer_config.kv_connector is "
+                    "configured."
+                )
+            backend = KVConnectorBackendFactory.create_backend(kv_connector, llm_config)


Factory fallback skips backend setup

Medium Severity

When _get_connector_backend falls back to KVConnectorBackendFactory.create_backend, it never calls setup() on the new instance. For MultiConnectorBackend, prepare_* delegates to sub-connectors populated only in setup(), so the first P/D request on that path raises ValueError instead of shaping traffic.

^{Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.}

cursor · 2026-06-10T01:20:00Z

+        return bool(self._connector_backends) and self._primary.concurrent_handoff
+
+    def prepare_prefill_request(self, *, request, peer):
+        return self._primary.prepare_prefill_request(request=request, peer=peer)


Empty MultiConnector crashes prepare calls

Medium Severity

If setup() runs with an empty connectors list, requires_peer_binding and concurrent_handoff read as false, yet prepare_prefill_request / prepare_decode_request still call _primary and raise ValueError. Previously the orchestrator applied default P/D shaping regardless of Multi configuration.

^{Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.}

kouroshHakha added the go add ONLY when ready to merge, run all tests label Jun 9, 2026

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

kouroshHakha force-pushed the mori/03-pd-connector-protocol branch 2 times, most recently from ff515c3 to e658122 Compare June 9, 2026 06:14

kouroshHakha force-pushed the mori/03-pd-connector-protocol branch from e658122 to beb5436 Compare June 9, 2026 06:16

kouroshHakha mentioned this pull request Jun 9, 2026

[serve.llm] Add MoRIIO KV-connector backend for prefill/decode #63951

Draft

kouroshHakha commented Jun 9, 2026

View reviewed changes

kouroshHakha commented Jun 10, 2026

View reviewed changes

kouroshHakha force-pushed the mori/03-pd-connector-protocol branch from ab4e759 to e3ffc5a Compare June 10, 2026 01:13

kouroshHakha marked this pull request as ready for review June 10, 2026 01:17

kouroshHakha requested a review from a team as a code owner June 10, 2026 01:17

cursor Bot reviewed Jun 10, 2026

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label Jun 10, 2026

Conversation

kouroshHakha commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Stack

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented Jun 9, 2026

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Factory fallback skips backend setup

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Empty MultiConnector crashes prepare calls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kouroshHakha commented Jun 9, 2026 •

edited

Loading