EPD Disaggregation: CPU NIXL EC Connector

 ## Summary

Add a ECCPUConnector that allows a consumer vLLM instance (PD or P node) to fetch multimodal encoder outputs (e.g. vision embeddings) from a remote producer vLLM instance (E node) over NIXL, so the consumer never runs the vision encoder locally.

##  Motivation

The llm-d disaggregated serving design calls for encoder outputs (vision embeddings) to be transferred directly from an encode node to a prefill/decode node, keyed by mm_hash.
Today no such transfer mechanism exists — the only reference implementation is ExampleECConnector, which uses a shared filesystem and is not suitable for production.

  This issue tracks the vLLM-side work to make that design real:
 - Consumer never downloads the image. The encode node runs --mm-encoder-only, processes the image once, and stores the embedding in a  CPU mmap region. The prefill/decode node receives only the embedding tensor — it never fetches the original image.
  -  NIXL for the transfer. Instead of a shared filesystem, embeddings are transferred peer-to-peer via NIXL directly between the encoder's and consumer's mmap regions. Similar mechanism used by PD-Disaggregation.
  - Minimal integration surface. The connector plugs into the existing ECConnectorBase hook; minimal changes to the model runner or scheduler core. 

##  Design

  - Shared: NIXL agent + registered ECSharedRegion mmap on both sides; XferReq/XferAck msgpack wire types; compatibility hash guards version/model/dtype mismatches.
- Producer: binds a ZMQ ROUTER; a background router thread ingests XferReq messages from consumers, pins the requested mmap blocks, posts async NIXL WRITEs from its CPU mmap
  region, and relays XferAcks back. Block eviction via FIFO policy (_producer_fifo_alloc) skips blocks pinned by in-flight transfers.
- Consumer: lazy ZMQ DEALER pool per (peer_host, peer_port); a _loaded mmap cache keeps completed transfer blocks alive for local re-copy on subsequent requests, avoiding
  producer round-trips. FIFO eviction via _consumer_fifo_alloc under allocation pressure, with _pending_reload protecting blocks committed to the current step.

### End-to-end architecture

```
                      ┌──────────────────────────────────────┐
                      │  External caller (orchestrator +     │
                      │  request router, out of scope here)  │
                      │                                      │
                      │  Duties (see §3.6):                  │
                      │    • download URL images to b64      │
                      │    • dispatch encode to producer     │
                      │      with max_tokens=1               │
                      │    • parse response's                │
                      │      ec_transfer_params              │
                      │    • populate consumer request's     │
                      │      sampling_params.extra_args      │
                      │      ["ec_transfer_params"]          │
                      └─┬──▲────────────────────────┬────────┘
                        │  │                        │
                   HTTP │  │ HTTP response body     │  HTTP req w/
                    req │  │ carries top-level      │  ec_transfer_params
                 (OpenAI│  │ ec_transfer_params     │  in extra_args
                  + b64 │  │                        │
                  +     │  │                        │
                  max=1)│  │                        │
                        ▼  │                        ▼
         ┌──────────────┬──┴──────────┐    ┌──────────────────────────┐
         │ Producer vLLM              │    │ Consumer vLLM            │
         │ ec_role=ec_producer        │    │ ec_role=ec_consumer      │
         │ mm_encoder_only=true       │    │                          │
         │                            │    │                          │
         │ Scheduler                  │    │ Scheduler                │
         │ ├─ ECSharedRegion          │    │ ├─ ECSharedRegion        │
         │ │    (mmap, NIXL-reg,      │    │ │    (mmap, NIXL-reg,    │
         │ │     alloc/free/pin)      │    │ │     alloc/free)        │
         │ ├─ _local_encodings:       │    │ ├─ _remote_encodings:    │
         │ │    mm_hash →             │    │ │    mm_hash →           │
         │ │    block_indices         │    │ │    block_indices       │
         │ ├─ NIXL agent              │    │ ├─ _ready: arrived       │
         │ ├─ ZMQ ROUTER              │    │ │    mm_hashes           │
         │ │   (VLLM_EC_SIDE_CHANNEL) │    │ ├─ NIXL agent            │
         │ ├─ router thread           │    │ ├─ peer pool:            │
         │ └─ in-flight xfers         │    │ │    (host,port) →       │
         │                            │    │ │    (DEALER,agent,md)   │
         │ Worker                     │    │ └─ ensure_cache_         │
         │ └─ save_caches:            │    │    available(request)    │
         │    GPU → mmap              │    │                          │
         │    at block_indices        │    │ Worker                   │
         │    from metadata           │    │ └─ start_load_caches:    │
         │                            │    │    mmap → GPU            │
         │ request_finished() emits   │    │    at block_indices      │
         │ ec_transfer_params         │    │    from metadata         │
         │ in response body           │    │                          │
         └──────────┬─────────────────┘    └──────────┬───────────────┘
                    │                                 │
                    │   ROUTER ← ZMQ DEALER           │
                    │   XferReq / XferAck             │
                    │  ◄───────────────────────────── │
                    │                                 │
                    │   NIXL WRITE                    │
                    │   mmap → remote mmap            │
                    │  ──────────────────────────────►│
                    └────────────────────────────────►│
```

### Cold-path lifecycle (first-time encoding)

1. Client sends an OpenAI `/v1/chat/completions` with `image_url` content to the external caller.
2. The caller downloads URL images (if any) and inlines as base64 — consumer vLLM never fetches.
3. The caller dispatches a normal `POST /v1/chat/completions` to a producer vLLM pod with the image and `max_tokens: 1`. The producer runs with `--mm-encoder-only` so only the vision encoder executes (§4.7).
4. Producer scheduler's `build_connector_meta` derives the encoding size from `mm_position.length × hidden_dim × element_size`, allocates the matching block count in its `ECSharedRegion`, and passes `(mm_hash → block_indices)` to the worker in `ECCPUConnectorMetadata`.
5. Producer worker runs the encoder. Inside `save_caches` it copies `encoder_cache[mm_hash]` → `mmap[block_indices]` and calls `torch.cuda.current_stream().synchronize()` before returning, so the bytes are coherent for cross-process reads.
6. Producer's response body includes top-level `ec_transfer_params = {mm_hash: {peer_host, peer_port, size_bytes, nixl_agent_metadata_b64}}`. The caller parses it.
7. The caller forwards the original OpenAI request to a consumer vLLM pod with `sampling_params.extra_args["ec_transfer_params"]` populated from step 6.
8. Consumer's scheduler calls `ensure_cache_available(request)`: allocates blocks in the local `ECSharedRegion`, resolves the peer via the pool, sends `XferReq` over DEALER, returns `True` (request deferred).
9. Producer scheduler's router thread receives the `XferReq`, validates `mm_hash` presence in `_local_encodings`, mints `xfer_id = uuid4()`, pins the mm_hash's block indices in the region, and posts a NIXL WRITE from `mmap[block_indices]` → consumer's destination blocks.
10. When `check_xfer_state(handle) == DONE`, producer scheduler sends `XferAck(ok=True)` back over the ROUTER, releases the handle, and unpins the blocks.
11. Consumer scheduler's `_drain_acks` runs on-demand from both `ensure_cache_available` and `build_connector_meta` — consumes the ack, moves the `mm_hash` to `_ready`. `ensure_cache_available` stops deferring the request; `build_connector_meta` adds the `mm_hash` to `ECCPUConnectorMetadata.loads` and moves the blocks into `_loaded` (a local mmap cache). The consumer worker's `start_load_caches` reads `mmap[block_indices]` → `encoder_cache[mm_hash]` at the start of its step. Subsequent requests for the same `mm_hash` are re-served with a local mmap→GPU re-copy via `_pending_reload`, bypassing the producer. Blocks are only freed when evicted under allocation pressure by `_consumer_lru_alloc`.

##  Dependencies (vLLM-side PRs)

  - [ec_connector_shutdown — shutdown API on ECConnectorBase](https://github.com/vllm-project/vllm/pull/42423)
  - [non_blocking_ec_connector_lookup](https://github.com/vllm-project/vllm/pull/41627)
  - [add_ec_transfer_params — ec_transfer_params field on request/response types](https://github.com/vllm-project/vllm/pull/42433)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPD Disaggregation: CPU NIXL EC Connector #602

Summary

Motivation

Design

End-to-end architecture

Cold-path lifecycle (first-time encoding)

Dependencies (vLLM-side PRs)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EPD Disaggregation: CPU NIXL EC Connector #602

Description

Summary

Motivation

Design

End-to-end architecture

Cold-path lifecycle (first-time encoding)

Dependencies (vLLM-side PRs)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions