Skip to content

EPD Disaggregation: CPU NIXL EC Connector #602

@omerpaz95

Description

@omerpaz95

Summary

Add a ECCPUConnector that allows a consumer vLLM instance (PD or P node) to fetch multimodal encoder outputs (e.g. vision embeddings) from a remote producer vLLM instance (E node) over NIXL, so the consumer never runs the vision encoder locally.

Motivation

The llm-d disaggregated serving design calls for encoder outputs (vision embeddings) to be transferred directly from an encode node to a prefill/decode node, keyed by mm_hash.
Today no such transfer mechanism exists — the only reference implementation is ExampleECConnector, which uses a shared filesystem and is not suitable for production.

This issue tracks the vLLM-side work to make that design real:

  • Consumer never downloads the image. The encode node runs --mm-encoder-only, processes the image once, and stores the embedding in a CPU mmap region. The prefill/decode node receives only the embedding tensor — it never fetches the original image.
  • NIXL for the transfer. Instead of a shared filesystem, embeddings are transferred peer-to-peer via NIXL directly between the encoder's and consumer's mmap regions. Similar mechanism used by PD-Disaggregation.
  • Minimal integration surface. The connector plugs into the existing ECConnectorBase hook; minimal changes to the model runner or scheduler core.

Design

  • Shared: NIXL agent + registered ECSharedRegion mmap on both sides; XferReq/XferAck msgpack wire types; compatibility hash guards version/model/dtype mismatches.
  • Producer: binds a ZMQ ROUTER; a background router thread ingests XferReq messages from consumers, pins the requested mmap blocks, posts async NIXL WRITEs from its CPU mmap
    region, and relays XferAcks back. Block eviction via FIFO policy (_producer_fifo_alloc) skips blocks pinned by in-flight transfers.
  • Consumer: lazy ZMQ DEALER pool per (peer_host, peer_port); a _loaded mmap cache keeps completed transfer blocks alive for local re-copy on subsequent requests, avoiding
    producer round-trips. FIFO eviction via _consumer_fifo_alloc under allocation pressure, with _pending_reload protecting blocks committed to the current step.

End-to-end architecture

                      ┌──────────────────────────────────────┐
                      │  External caller (orchestrator +     │
                      │  request router, out of scope here)  │
                      │                                      │
                      │  Duties (see §3.6):                  │
                      │    • download URL images to b64      │
                      │    • dispatch encode to producer     │
                      │      with max_tokens=1               │
                      │    • parse response's                │
                      │      ec_transfer_params              │
                      │    • populate consumer request's     │
                      │      sampling_params.extra_args      │
                      │      ["ec_transfer_params"]          │
                      └─┬──▲────────────────────────┬────────┘
                        │  │                        │
                   HTTP │  │ HTTP response body     │  HTTP req w/
                    req │  │ carries top-level      │  ec_transfer_params
                 (OpenAI│  │ ec_transfer_params     │  in extra_args
                  + b64 │  │                        │
                  +     │  │                        │
                  max=1)│  │                        │
                        ▼  │                        ▼
         ┌──────────────┬──┴──────────┐    ┌──────────────────────────┐
         │ Producer vLLM              │    │ Consumer vLLM            │
         │ ec_role=ec_producer        │    │ ec_role=ec_consumer      │
         │ mm_encoder_only=true       │    │                          │
         │                            │    │                          │
         │ Scheduler                  │    │ Scheduler                │
         │ ├─ ECSharedRegion          │    │ ├─ ECSharedRegion        │
         │ │    (mmap, NIXL-reg,      │    │ │    (mmap, NIXL-reg,    │
         │ │     alloc/free/pin)      │    │ │     alloc/free)        │
         │ ├─ _local_encodings:       │    │ ├─ _remote_encodings:    │
         │ │    mm_hash →             │    │ │    mm_hash →           │
         │ │    block_indices         │    │ │    block_indices       │
         │ ├─ NIXL agent              │    │ ├─ _ready: arrived       │
         │ ├─ ZMQ ROUTER              │    │ │    mm_hashes           │
         │ │   (VLLM_EC_SIDE_CHANNEL) │    │ ├─ NIXL agent            │
         │ ├─ router thread           │    │ ├─ peer pool:            │
         │ └─ in-flight xfers         │    │ │    (host,port) →       │
         │                            │    │ │    (DEALER,agent,md)   │
         │ Worker                     │    │ └─ ensure_cache_         │
         │ └─ save_caches:            │    │    available(request)    │
         │    GPU → mmap              │    │                          │
         │    at block_indices        │    │ Worker                   │
         │    from metadata           │    │ └─ start_load_caches:    │
         │                            │    │    mmap → GPU            │
         │ request_finished() emits   │    │    at block_indices      │
         │ ec_transfer_params         │    │    from metadata         │
         │ in response body           │    │                          │
         └──────────┬─────────────────┘    └──────────┬───────────────┘
                    │                                 │
                    │   ROUTER ← ZMQ DEALER           │
                    │   XferReq / XferAck             │
                    │  ◄───────────────────────────── │
                    │                                 │
                    │   NIXL WRITE                    │
                    │   mmap → remote mmap            │
                    │  ──────────────────────────────►│
                    └────────────────────────────────►│

Cold-path lifecycle (first-time encoding)

  1. Client sends an OpenAI /v1/chat/completions with image_url content to the external caller.
  2. The caller downloads URL images (if any) and inlines as base64 — consumer vLLM never fetches.
  3. The caller dispatches a normal POST /v1/chat/completions to a producer vLLM pod with the image and max_tokens: 1. The producer runs with --mm-encoder-only so only the vision encoder executes (§4.7).
  4. Producer scheduler's build_connector_meta derives the encoding size from mm_position.length × hidden_dim × element_size, allocates the matching block count in its ECSharedRegion, and passes (mm_hash → block_indices) to the worker in ECCPUConnectorMetadata.
  5. Producer worker runs the encoder. Inside save_caches it copies encoder_cache[mm_hash]mmap[block_indices] and calls torch.cuda.current_stream().synchronize() before returning, so the bytes are coherent for cross-process reads.
  6. Producer's response body includes top-level ec_transfer_params = {mm_hash: {peer_host, peer_port, size_bytes, nixl_agent_metadata_b64}}. The caller parses it.
  7. The caller forwards the original OpenAI request to a consumer vLLM pod with sampling_params.extra_args["ec_transfer_params"] populated from step 6.
  8. Consumer's scheduler calls ensure_cache_available(request): allocates blocks in the local ECSharedRegion, resolves the peer via the pool, sends XferReq over DEALER, returns True (request deferred).
  9. Producer scheduler's router thread receives the XferReq, validates mm_hash presence in _local_encodings, mints xfer_id = uuid4(), pins the mm_hash's block indices in the region, and posts a NIXL WRITE from mmap[block_indices] → consumer's destination blocks.
  10. When check_xfer_state(handle) == DONE, producer scheduler sends XferAck(ok=True) back over the ROUTER, releases the handle, and unpins the blocks.
  11. Consumer scheduler's _drain_acks runs on-demand from both ensure_cache_available and build_connector_meta — consumes the ack, moves the mm_hash to _ready. ensure_cache_available stops deferring the request; build_connector_meta adds the mm_hash to ECCPUConnectorMetadata.loads and moves the blocks into _loaded (a local mmap cache). The consumer worker's start_load_caches reads mmap[block_indices]encoder_cache[mm_hash] at the start of its step. Subsequent requests for the same mm_hash are re-served with a local mmap→GPU re-copy via _pending_reload, bypassing the producer. Blocks are only freed when evicted under allocation pressure by _consumer_lru_alloc.

Dependencies (vLLM-side PRs)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions