-
Notifications
You must be signed in to change notification settings - Fork 134
Closed
Labels
component/sidecartriage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.
Milestone
Description
Upstream gateway support of multiport inference pool is not ready yet (e.g., this).
While above are resolved, can use the sidecar as the entrypoint to multiple
- InferencePool defined with multiple ports => EPP uses data parallel mode
- Each of the vLLM endpoints (i.e., multiple ports on the same Pod IP) is tracked by the inference scheduler independently. This is aligned with current implementation. Each vLLM process has its own
/metricsendpoint, etc. - All inference requests are sent to the sidecar which listens on the primary IP:port (i.e., first port only). See below on how accomplished.
- The sidecar receives a new header, set in the EPP, that instructs it to forward to a specific port of the relevant vLLM. The sidecar can do sanity checks (e.g., same IP, port is in range defined in its startup, etc.) on the header and it can also be cleansed by the EPP on receiving client request (same as it should do for the prefill header being injected by clients).
- EPP processes request normally (including disaggregated P/D, I think). We add a new PreRequest plugin that takes the target pod and places it in the new sidecar header and nserts the primary IP:port (i.e., sidecar) into the request so Envoy can route to the "default primary cluster" always. If prerequest hook is insufficient, can achieve the same behavior emulating separate cycles as in P/D with the second cycle only haveing a Picker.
Changes are needed in the EPP and sidecar.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/sidecartriage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.
Type
Projects
Status
Done