Skip to content

[sidecar] Implement data parallel based routing using port hints #380

@elevran

Description

@elevran

Upstream gateway support of multiport inference pool is not ready yet (e.g., this).

While above are resolved, can use the sidecar as the entrypoint to multiple

  • InferencePool defined with multiple ports => EPP uses data parallel mode
  • Each of the vLLM endpoints (i.e., multiple ports on the same Pod IP) is tracked by the inference scheduler independently. This is aligned with current implementation. Each vLLM process has its own /metrics endpoint, etc.
  • All inference requests are sent to the sidecar which listens on the primary IP:port (i.e., first port only). See below on how accomplished.
  • The sidecar receives a new header, set in the EPP, that instructs it to forward to a specific port of the relevant vLLM. The sidecar can do sanity checks (e.g., same IP, port is in range defined in its startup, etc.) on the header and it can also be cleansed by the EPP on receiving client request (same as it should do for the prefill header being injected by clients).
  • EPP processes request normally (including disaggregated P/D, I think). We add a new PreRequest plugin that takes the target pod and places it in the new sidecar header and nserts the primary IP:port (i.e., sidecar) into the request so Envoy can route to the "default primary cluster" always. If prerequest hook is insufficient, can achieve the same behavior emulating separate cycles as in P/D with the second cycle only haveing a Picker.

Changes are needed in the EPP and sidecar.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions