Skip to content

[serve.llm] Delegate P/D orchestration to the KV-connector backend#63950

Open
kouroshHakha wants to merge 3 commits into
ray-project:masterfrom
kouroshHakha:mori/03-pd-connector-protocol
Open

[serve.llm] Delegate P/D orchestration to the KV-connector backend#63950
kouroshHakha wants to merge 3 commits into
ray-project:masterfrom
kouroshHakha:mori/03-pd-connector-protocol

Conversation

@kouroshHakha

@kouroshHakha kouroshHakha commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Why

Make the KV-transfer connector backend the single place that owns prefill/decode (P/D)
request shaping, peer addressing, and handoff discipline — so connectors plug into the P/D
orchestrator without bespoke orchestrator branches. This generalizes the orchestrator beyond a
single hard-coded policy and is the foundation for connectors (e.g. a request-id-addressed,
push-based one) that need pre-dispatch peer binding and/or a concurrent handoff.

What

BaseConnectorBackend defines an abstract P/D protocol:

  • prepare_prefill_request(*, request, peer) and prepare_decode_request(*, request, peer, prefill_response)
    (keyword-only, typed RequestType = ChatCompletionRequest | CompletionRequest).
  • Two independent policy flags:
    • requires_peer_binding — when True, the orchestrator selects the prefill replica first
      (choose_replica) and passes its replica_metadata to the backend as peer (pre-dispatch
      addressing); when False, prefill is dispatched via the standard handle path.
    • concurrent_handoff — when True, remote prefill and local decode run concurrently; when
      False, prefill runs to its first chunk before decode starts (sequential).
    • These are independent: standard connectors are (False, False); a push-based,
      request-id-addressed connector is (True, True); a pull-based one is (True, False).

The standard (no-peer, sequential) policy lives in a shared DefaultPDProtocolMixin:

  • NIXL and LMCache backends inherit it (migrated onto the interface).
  • Multi connector backend delegates prepare_* and the policy flags to its top-most
    sub-connector, so that sub-connector's policy governs the group.
  • A connector with a different protocol overrides prepare_* and opts into the flags.

PDOrchestratorMixin resolves the backend once (cached) via _get_connector_backend() and routes
all connectors' request shaping + handoff through it. The concurrent handoff is a single
_concurrent_decode helper that always drains the remote prefill and cancels it if local decode
doesn't complete (no leaked background prefill).

Stack

Part of a series enabling new KV connectors for Ray Serve LLM P/D; pairs with the per-replica
metadata hook (exposes ReplicaSelection.replica_metadata for pre-dispatch peer binding) and the
engine request-id PR. A MoRIIO connector backend (request-id-addressed) builds on this in a follow-up.

Testing

  • test_pd_protocol.py: abstract base can't be instantiated; NIXL/LMCache/default expose the
    default-mixin shaping (incl. the prefill_response=None guard for concurrent mode); Multi
    delegates prepare_*/flags to its top-most sub-connector.
  • test_prefill_decode_disagg.py / test_factory.py: orchestration routes prefill/decode shaping
    through the resolved backend; concurrent-handoff cancels the background prefill on decode failure;
    factory resolution + fallback.

🤖 Generated with Claude Code

@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Jun 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Prefill/Decode (P/D) orchestration flow to delegate request shaping, peer addressing, and handoff discipline to a resolved KV-connector backend, supporting concurrent handoffs and pre-dispatch peer binding. The review feedback highlights a potential AttributeError when prefill_response is None during concurrent handoffs, resource leaks due to uncancelled background prefill tasks when local decode fails or is cancelled, and fragile static calls to backend instance methods with a None self-reference.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
@kouroshHakha kouroshHakha force-pushed the mori/03-pd-connector-protocol branch 2 times, most recently from ff515c3 to e658122 Compare June 9, 2026 06:14
Refactors PDOrchestratorMixin to delegate request shaping, peer addressing,
and handoff discipline to BaseConnectorBackend (requires_peer_binding,
concurrent_handoff, prepare_prefill_request, prepare_decode_request). The
defaults reproduce the existing NIXL/default flow exactly; connectors that
need pre-dispatch peer binding (e.g. request-id-addressed transfers) can opt
into choose_replica + concurrent handoff without new orchestrator concepts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pr does a half baked job in what is outlined in the description. We need to migrate the existing nixl connector implementation to this new interface in the same pr and do end to end tests. Basically we want the nixl, lmcache and multi connector to all use the new interface for defining the prefill and decode request and routing handlers through the connector backend implementation.

Comment thread python/ray/llm/_internal/serve/core/configs/llm_config.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py
Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py
…he/Multi

- Make BaseConnectorBackend.prepare_prefill/decode_request abstract; add a
  shared DefaultPDProtocolMixin with the standard (no-peer, sequential) policy.
- NIXL, LMCache, and Multi connector backends now use the interface via the mixin.
- Guard prepare_decode_request against a None prefill_response (concurrent mode).
- Cancel the background prefill task if local decode fails/cancels (no leak).
- Keyword-only, typed prepare_* signatures; Optional[BaseConnectorBackend] return type.
- Connector-agnostic protocol docs; keep requires_peer_binding/concurrent_handoff
  flags (independent: standard=F,F; push-addressed=T,T; pull-addressed=T,F).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha

Copy link
Copy Markdown
Contributor Author

Addressed in b2fd6e5 — the existing connectors are now migrated onto the interface in this PR:

  • BaseConnectorBackend.prepare_prefill_request / prepare_decode_request are now abstract, and the standard (no-peer, sequential) policy lives in a shared DefaultPDProtocolMixin.
  • NIXL, LMCache, and Multi connector backends all inherit DefaultPDProtocolMixin, so request shaping for every connector now flows through the connector-backend interface (the orchestrator resolves the backend via _get_connector_backend() and routes prefill/decode shaping through it). Multi uses the default policy, with a note that a future custom-shaping sub-connector should delegate to that sub-connector's backend.
  • A request-id-addressed connector (the follow-up MoRIIO PR) overrides prepare_* and opts into requires_peer_binding/concurrent_handoff.

Plus the gemini-flagged robustness fixes (None-guard on prefill_response, cancel the background prefill task if decode fails/cancels, no more self=None calls) and the typing/kwargs/comment cleanups.

Tests (test_pd_protocol.py + updated test_prefill_decode_disagg.py / test_factory.py): assert the abstract base can't be instantiated, the mixin's shaping (incl. the None-guard), that all four backends (NIXL/LMCache/Multi/Default) route through the interface, the NIXL-backed prefill→decode shaping end-to-end on a mock engine, and the concurrent cancel-on-failure path. CI (premerge go) will exercise the full prefill-decode-disagg suite on a clean checkout.

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving some comments.

Comment thread python/ray/llm/_internal/serve/engines/vllm/kv_transfer/multi_connector.py Outdated
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py Outdated
…ng, helper

- MultiConnectorBackend delegates prepare_*/flags to its top-most sub-connector
  (rather than inheriting the default mixin), so a sub-connector's policy governs.
- Cache the resolved connector backend on the server (no per-request factory call).
- Extract the concurrent prefill+decode handoff into a _concurrent_decode helper
  (dedupes the two paths; cancels the background prefill if decode doesn't finish).
- Inline peer=None in the default path; drop the git-history comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha force-pushed the mori/03-pd-connector-protocol branch from ab4e759 to e3ffc5a Compare June 10, 2026 01:13
@kouroshHakha kouroshHakha marked this pull request as ready for review June 10, 2026 01:17
@kouroshHakha kouroshHakha requested a review from a team as a code owner June 10, 2026 01:17

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.

"on the LLMConfig and no kv_transfer_config.kv_connector is "
"configured."
)
backend = KVConnectorBackendFactory.create_backend(kv_connector, llm_config)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factory fallback skips backend setup

Medium Severity

When _get_connector_backend falls back to KVConnectorBackendFactory.create_backend, it never calls setup() on the new instance. For MultiConnectorBackend, prepare_* delegates to sub-connectors populated only in setup(), so the first P/D request on that path raises ValueError instead of shaping traffic.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.

return bool(self._connector_backends) and self._primary.concurrent_handoff

def prepare_prefill_request(self, *, request, peer):
return self._primary.prepare_prefill_request(request=request, peer=peer)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty MultiConnector crashes prepare calls

Medium Severity

If setup() runs with an empty connectors list, requires_peer_binding and concurrent_handoff read as false, yet prepare_prefill_request / prepare_decode_request still call _primary and raise ValueError. Previously the orchestrator applied default P/D shaping regardless of Multi configuration.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant