[serve.llm] Delegate P/D orchestration to the KV-connector backend#63950
[serve.llm] Delegate P/D orchestration to the KV-connector backend#63950kouroshHakha wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Prefill/Decode (P/D) orchestration flow to delegate request shaping, peer addressing, and handoff discipline to a resolved KV-connector backend, supporting concurrent handoffs and pre-dispatch peer binding. The review feedback highlights a potential AttributeError when prefill_response is None during concurrent handoffs, resource leaks due to uncancelled background prefill tasks when local decode fails or is cancelled, and fragile static calls to backend instance methods with a None self-reference.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
ff515c3 to
e658122
Compare
Refactors PDOrchestratorMixin to delegate request shaping, peer addressing, and handoff discipline to BaseConnectorBackend (requires_peer_binding, concurrent_handoff, prepare_prefill_request, prepare_decode_request). The defaults reproduce the existing NIXL/default flow exactly; connectors that need pre-dispatch peer binding (e.g. request-id-addressed transfers) can opt into choose_replica + concurrent handoff without new orchestrator concepts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
e658122 to
beb5436
Compare
kouroshHakha
left a comment
There was a problem hiding this comment.
this pr does a half baked job in what is outlined in the description. We need to migrate the existing nixl connector implementation to this new interface in the same pr and do end to end tests. Basically we want the nixl, lmcache and multi connector to all use the new interface for defining the prefill and decode request and routing handlers through the connector backend implementation.
…he/Multi - Make BaseConnectorBackend.prepare_prefill/decode_request abstract; add a shared DefaultPDProtocolMixin with the standard (no-peer, sequential) policy. - NIXL, LMCache, and Multi connector backends now use the interface via the mixin. - Guard prepare_decode_request against a None prefill_response (concurrent mode). - Cancel the background prefill task if local decode fails/cancels (no leak). - Keyword-only, typed prepare_* signatures; Optional[BaseConnectorBackend] return type. - Connector-agnostic protocol docs; keep requires_peer_binding/concurrent_handoff flags (independent: standard=F,F; push-addressed=T,T; pull-addressed=T,F). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
|
Addressed in b2fd6e5 — the existing connectors are now migrated onto the interface in this PR:
Plus the gemini-flagged robustness fixes (None-guard on Tests ( |
kouroshHakha
left a comment
There was a problem hiding this comment.
leaving some comments.
…ng, helper - MultiConnectorBackend delegates prepare_*/flags to its top-most sub-connector (rather than inheriting the default mixin), so a sub-connector's policy governs. - Cache the resolved connector backend on the server (no per-request factory call). - Extract the concurrent prefill+decode handoff into a _concurrent_decode helper (dedupes the two paths; cancels the background prefill if decode doesn't finish). - Inline peer=None in the default path; drop the git-history comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
ab4e759 to
e3ffc5a
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.
| "on the LLMConfig and no kv_transfer_config.kv_connector is " | ||
| "configured." | ||
| ) | ||
| backend = KVConnectorBackendFactory.create_backend(kv_connector, llm_config) |
There was a problem hiding this comment.
Factory fallback skips backend setup
Medium Severity
When _get_connector_backend falls back to KVConnectorBackendFactory.create_backend, it never calls setup() on the new instance. For MultiConnectorBackend, prepare_* delegates to sub-connectors populated only in setup(), so the first P/D request on that path raises ValueError instead of shaping traffic.
Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.
| return bool(self._connector_backends) and self._primary.concurrent_handoff | ||
|
|
||
| def prepare_prefill_request(self, *, request, peer): | ||
| return self._primary.prepare_prefill_request(request=request, peer=peer) |
There was a problem hiding this comment.
Empty MultiConnector crashes prepare calls
Medium Severity
If setup() runs with an empty connectors list, requires_peer_binding and concurrent_handoff read as false, yet prepare_prefill_request / prepare_decode_request still call _primary and raise ValueError. Previously the orchestrator applied default P/D shaping regardless of Multi configuration.
Reviewed by Cursor Bugbot for commit e3ffc5a. Configure here.


Why
Make the KV-transfer connector backend the single place that owns prefill/decode (P/D)
request shaping, peer addressing, and handoff discipline — so connectors plug into the P/D
orchestrator without bespoke orchestrator branches. This generalizes the orchestrator beyond a
single hard-coded policy and is the foundation for connectors (e.g. a request-id-addressed,
push-based one) that need pre-dispatch peer binding and/or a concurrent handoff.
What
BaseConnectorBackenddefines an abstract P/D protocol:prepare_prefill_request(*, request, peer)andprepare_decode_request(*, request, peer, prefill_response)(keyword-only, typed
RequestType = ChatCompletionRequest | CompletionRequest).requires_peer_binding— when True, the orchestrator selects the prefill replica first(
choose_replica) and passes itsreplica_metadatato the backend aspeer(pre-dispatchaddressing); when False, prefill is dispatched via the standard handle path.
concurrent_handoff— when True, remote prefill and local decode run concurrently; whenFalse, prefill runs to its first chunk before decode starts (sequential).
(False, False); a push-based,request-id-addressed connector is
(True, True); a pull-based one is(True, False).The standard (no-peer, sequential) policy lives in a shared
DefaultPDProtocolMixin:prepare_*and the policy flags to its top-mostsub-connector, so that sub-connector's policy governs the group.
prepare_*and opts into the flags.PDOrchestratorMixinresolves the backend once (cached) via_get_connector_backend()and routesall connectors' request shaping + handoff through it. The concurrent handoff is a single
_concurrent_decodehelper that always drains the remote prefill and cancels it if local decodedoesn't complete (no leaked background prefill).
Stack
Part of a series enabling new KV connectors for Ray Serve LLM P/D; pairs with the per-replica
metadata hook (exposes
ReplicaSelection.replica_metadatafor pre-dispatch peer binding) and theengine request-id PR. A MoRIIO connector backend (request-id-addressed) builds on this in a follow-up.
Testing
test_pd_protocol.py: abstract base can't be instantiated; NIXL/LMCache/default expose thedefault-mixin shaping (incl. the
prefill_response=Noneguard for concurrent mode); Multidelegates
prepare_*/flags to its top-most sub-connector.test_prefill_decode_disagg.py/test_factory.py: orchestration routes prefill/decode shapingthrough the resolved backend; concurrent-handoff cancels the background prefill on decode failure;
factory resolution + fallback.
🤖 Generated with Claude Code