Summary
Add a ECCPUConnector that allows a consumer vLLM instance (PD or P node) to fetch multimodal encoder outputs (e.g. vision embeddings) from a remote producer vLLM instance (E node) over NIXL, so the consumer never runs the vision encoder locally.
Motivation
The llm-d disaggregated serving design calls for encoder outputs (vision embeddings) to be transferred directly from an encode node to a prefill/decode node, keyed by mm_hash.
Today no such transfer mechanism exists — the only reference implementation is ExampleECConnector, which uses a shared filesystem and is not suitable for production.
This issue tracks the vLLM-side work to make that design real:
- Consumer never downloads the image. The encode node runs --mm-encoder-only, processes the image once, and stores the embedding in a CPU mmap region. The prefill/decode node receives only the embedding tensor — it never fetches the original image.
- NIXL for the transfer. Instead of a shared filesystem, embeddings are transferred peer-to-peer via NIXL directly between the encoder's and consumer's mmap regions. Similar mechanism used by PD-Disaggregation.
- Minimal integration surface. The connector plugs into the existing ECConnectorBase hook; minimal changes to the model runner or scheduler core.
Design
- Shared: NIXL agent + registered ECSharedRegion mmap on both sides; XferReq/XferAck msgpack wire types; compatibility hash guards version/model/dtype mismatches.
- Producer: binds a ZMQ ROUTER; a background router thread ingests XferReq messages from consumers, pins the requested mmap blocks, posts async NIXL WRITEs from its CPU mmap
region, and relays XferAcks back. Block eviction via FIFO policy (_producer_fifo_alloc) skips blocks pinned by in-flight transfers.
- Consumer: lazy ZMQ DEALER pool per (peer_host, peer_port); a _loaded mmap cache keeps completed transfer blocks alive for local re-copy on subsequent requests, avoiding
producer round-trips. FIFO eviction via _consumer_fifo_alloc under allocation pressure, with _pending_reload protecting blocks committed to the current step.
End-to-end architecture
┌──────────────────────────────────────┐
│ External caller (orchestrator + │
│ request router, out of scope here) │
│ │
│ Duties (see §3.6): │
│ • download URL images to b64 │
│ • dispatch encode to producer │
│ with max_tokens=1 │
│ • parse response's │
│ ec_transfer_params │
│ • populate consumer request's │
│ sampling_params.extra_args │
│ ["ec_transfer_params"] │
└─┬──▲────────────────────────┬────────┘
│ │ │
HTTP │ │ HTTP response body │ HTTP req w/
req │ │ carries top-level │ ec_transfer_params
(OpenAI│ │ ec_transfer_params │ in extra_args
+ b64 │ │ │
+ │ │ │
max=1)│ │ │
▼ │ ▼
┌──────────────┬──┴──────────┐ ┌──────────────────────────┐
│ Producer vLLM │ │ Consumer vLLM │
│ ec_role=ec_producer │ │ ec_role=ec_consumer │
│ mm_encoder_only=true │ │ │
│ │ │ │
│ Scheduler │ │ Scheduler │
│ ├─ ECSharedRegion │ │ ├─ ECSharedRegion │
│ │ (mmap, NIXL-reg, │ │ │ (mmap, NIXL-reg, │
│ │ alloc/free/pin) │ │ │ alloc/free) │
│ ├─ _local_encodings: │ │ ├─ _remote_encodings: │
│ │ mm_hash → │ │ │ mm_hash → │
│ │ block_indices │ │ │ block_indices │
│ ├─ NIXL agent │ │ ├─ _ready: arrived │
│ ├─ ZMQ ROUTER │ │ │ mm_hashes │
│ │ (VLLM_EC_SIDE_CHANNEL) │ │ ├─ NIXL agent │
│ ├─ router thread │ │ ├─ peer pool: │
│ └─ in-flight xfers │ │ │ (host,port) → │
│ │ │ │ (DEALER,agent,md) │
│ Worker │ │ └─ ensure_cache_ │
│ └─ save_caches: │ │ available(request) │
│ GPU → mmap │ │ │
│ at block_indices │ │ Worker │
│ from metadata │ │ └─ start_load_caches: │
│ │ │ mmap → GPU │
│ request_finished() emits │ │ at block_indices │
│ ec_transfer_params │ │ from metadata │
│ in response body │ │ │
└──────────┬─────────────────┘ └──────────┬───────────────┘
│ │
│ ROUTER ← ZMQ DEALER │
│ XferReq / XferAck │
│ ◄───────────────────────────── │
│ │
│ NIXL WRITE │
│ mmap → remote mmap │
│ ──────────────────────────────►│
└────────────────────────────────►│
Cold-path lifecycle (first-time encoding)
- Client sends an OpenAI
/v1/chat/completions with image_url content to the external caller.
- The caller downloads URL images (if any) and inlines as base64 — consumer vLLM never fetches.
- The caller dispatches a normal
POST /v1/chat/completions to a producer vLLM pod with the image and max_tokens: 1. The producer runs with --mm-encoder-only so only the vision encoder executes (§4.7).
- Producer scheduler's
build_connector_meta derives the encoding size from mm_position.length × hidden_dim × element_size, allocates the matching block count in its ECSharedRegion, and passes (mm_hash → block_indices) to the worker in ECCPUConnectorMetadata.
- Producer worker runs the encoder. Inside
save_caches it copies encoder_cache[mm_hash] → mmap[block_indices] and calls torch.cuda.current_stream().synchronize() before returning, so the bytes are coherent for cross-process reads.
- Producer's response body includes top-level
ec_transfer_params = {mm_hash: {peer_host, peer_port, size_bytes, nixl_agent_metadata_b64}}. The caller parses it.
- The caller forwards the original OpenAI request to a consumer vLLM pod with
sampling_params.extra_args["ec_transfer_params"] populated from step 6.
- Consumer's scheduler calls
ensure_cache_available(request): allocates blocks in the local ECSharedRegion, resolves the peer via the pool, sends XferReq over DEALER, returns True (request deferred).
- Producer scheduler's router thread receives the
XferReq, validates mm_hash presence in _local_encodings, mints xfer_id = uuid4(), pins the mm_hash's block indices in the region, and posts a NIXL WRITE from mmap[block_indices] → consumer's destination blocks.
- When
check_xfer_state(handle) == DONE, producer scheduler sends XferAck(ok=True) back over the ROUTER, releases the handle, and unpins the blocks.
- Consumer scheduler's
_drain_acks runs on-demand from both ensure_cache_available and build_connector_meta — consumes the ack, moves the mm_hash to _ready. ensure_cache_available stops deferring the request; build_connector_meta adds the mm_hash to ECCPUConnectorMetadata.loads and moves the blocks into _loaded (a local mmap cache). The consumer worker's start_load_caches reads mmap[block_indices] → encoder_cache[mm_hash] at the start of its step. Subsequent requests for the same mm_hash are re-served with a local mmap→GPU re-copy via _pending_reload, bypassing the producer. Blocks are only freed when evicted under allocation pressure by _consumer_lru_alloc.
Dependencies (vLLM-side PRs)
Summary
Add a ECCPUConnector that allows a consumer vLLM instance (PD or P node) to fetch multimodal encoder outputs (e.g. vision embeddings) from a remote producer vLLM instance (E node) over NIXL, so the consumer never runs the vision encoder locally.
Motivation
The llm-d disaggregated serving design calls for encoder outputs (vision embeddings) to be transferred directly from an encode node to a prefill/decode node, keyed by mm_hash.
Today no such transfer mechanism exists — the only reference implementation is ExampleECConnector, which uses a shared filesystem and is not suitable for production.
This issue tracks the vLLM-side work to make that design real:
Design
region, and relays XferAcks back. Block eviction via FIFO policy (_producer_fifo_alloc) skips blocks pinned by in-flight transfers.
producer round-trips. FIFO eviction via _consumer_fifo_alloc under allocation pressure, with _pending_reload protecting blocks committed to the current step.
End-to-end architecture
Cold-path lifecycle (first-time encoding)
/v1/chat/completionswithimage_urlcontent to the external caller.POST /v1/chat/completionsto a producer vLLM pod with the image andmax_tokens: 1. The producer runs with--mm-encoder-onlyso only the vision encoder executes (§4.7).build_connector_metaderives the encoding size frommm_position.length × hidden_dim × element_size, allocates the matching block count in itsECSharedRegion, and passes(mm_hash → block_indices)to the worker inECCPUConnectorMetadata.save_cachesit copiesencoder_cache[mm_hash]→mmap[block_indices]and callstorch.cuda.current_stream().synchronize()before returning, so the bytes are coherent for cross-process reads.ec_transfer_params = {mm_hash: {peer_host, peer_port, size_bytes, nixl_agent_metadata_b64}}. The caller parses it.sampling_params.extra_args["ec_transfer_params"]populated from step 6.ensure_cache_available(request): allocates blocks in the localECSharedRegion, resolves the peer via the pool, sendsXferReqover DEALER, returnsTrue(request deferred).XferReq, validatesmm_hashpresence in_local_encodings, mintsxfer_id = uuid4(), pins the mm_hash's block indices in the region, and posts a NIXL WRITE frommmap[block_indices]→ consumer's destination blocks.check_xfer_state(handle) == DONE, producer scheduler sendsXferAck(ok=True)back over the ROUTER, releases the handle, and unpins the blocks._drain_acksruns on-demand from bothensure_cache_availableandbuild_connector_meta— consumes the ack, moves themm_hashto_ready.ensure_cache_availablestops deferring the request;build_connector_metaadds themm_hashtoECCPUConnectorMetadata.loadsand moves the blocks into_loaded(a local mmap cache). The consumer worker'sstart_load_cachesreadsmmap[block_indices]→encoder_cache[mm_hash]at the start of its step. Subsequent requests for the samemm_hashare re-served with a local mmap→GPU re-copy via_pending_reload, bypassing the producer. Blocks are only freed when evicted under allocation pressure by_consumer_lru_alloc.Dependencies (vLLM-side PRs)