[sidecar] Implement data parallel based routing using port hints

Upstream gateway support of multiport inference pool is not ready yet (e.g., [this](https://github.com/istio/istio/issues/57638)).

While above are resolved, can use the sidecar as the entrypoint to multiple 
- InferencePool defined with multiple ports => EPP uses data parallel mode
- Each of the vLLM endpoints (i.e., multiple ports on the same Pod IP) is tracked by the inference scheduler independently. This is aligned with current implementation. Each vLLM process has its own `/metrics` endpoint, etc.
- All inference requests are sent to the sidecar which listens on the primary IP:port (i.e., first port only). See below on how accomplished.
- The sidecar receives a new header, set in the EPP, that instructs it to forward to a specific port of the relevant vLLM. The sidecar can do sanity checks (e.g., same IP, port is in range defined in its startup, etc.) on the header and it can also be cleansed by the EPP on receiving client request (same as it should do for the prefill header being injected by clients).
- EPP processes request normally (including disaggregated P/D, I think). We add a new PreRequest plugin that takes the target pod and places it in the new sidecar header and nserts the primary IP:port (i.e., sidecar) into the request so Envoy can route to the "default primary cluster" always. If prerequest hook is insufficient, can achieve the same behavior emulating separate cycles as in P/D with the second cycle only haveing a Picker.

Changes are needed in the EPP and sidecar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sidecar] Implement data parallel based routing using port hints #380

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[sidecar] Implement data parallel based routing using port hints #380

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions