feat: Add vLLM Data Parallel support to llm-d-inference-scheduler#392
feat: Add vLLM Data Parallel support to llm-d-inference-scheduler#392github-actions[bot] merged 19 commits intollm-d:mainfrom
Conversation
|
|
||
| resources: | ||
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.0.0 | ||
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1 |
There was a problem hiding this comment.
v1.1.0 is out. let's use formal release:
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1 | |
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0 |
There was a problem hiding this comment.
The go.mod file still refers to v1.1.0-rc.1. I'd prefer to keep things in sync.
The move to GIE 1.1.0 should be in a separate PR
| - "watch" | ||
| - "list" | ||
| - apiGroups: | ||
| - "inference.networking.k8s.io" |
There was a problem hiding this comment.
can we remove inferencepools from the inference.networking.x-k8s.io group?
There was a problem hiding this comment.
I left it in in case someone wants to use the v1alpha2 version...
There was a problem hiding this comment.
The kgateway deployment files still use the v1alpha2 InferencePool
| verbs: | ||
| - update | ||
| - patch | ||
| - apiGroups: |
There was a problem hiding this comment.
ditto - remove from inference.networking.x-k8s.io group.
There was a problem hiding this comment.
The kgateway deployment files still use the v1alpha2 InferencePool
| - "--zmq-endpoint=tcp://${EPP_NAME}.default.svc.cluster.local:5557" | ||
| - "--event-batch-size=16" | ||
| - "--tokenizers-cache-dir=/tokenizer-cache" | ||
| - "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}" |
There was a problem hiding this comment.
I'm a bit confused.
the flag --data-parallel-size is reading from env var VLLM_DATA_PARALLEL_SIZE which makes this configureable (which is great).
but the ports list below is hardcoded.
There was a problem hiding this comment.
The ports just need to be exposed. There can't be more than eight ports. The flag simply says how many we want to use.
| - "--vllm-port=8200" | ||
| - "--connector=lmcache" | ||
| - "--secure-proxy=false" | ||
| - "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}" |
There was a problem hiding this comment.
I thought dp is not used together with pd.
(this file is vllm-sim-pd).
what am I missing?
There was a problem hiding this comment.
I have updated the non-pd deployment files to include the sidecar. The script has been changed to launch using the non-pd deployment files even if VLLM_DATA_PARALLEL_SIZE is greater than one.
A subsequent PR will add DP support for PD deployments. For now we'll leave this change here
There was a problem hiding this comment.
The primary use for DP will be with PD deployments, I think supporting it is a feature blocker. Today's hot deployments are TP+DP+EP with PD 😆.
| for _, target := range profileResult.TargetPods { | ||
| newPodInfo := target.GetPod().Clone() | ||
| newPodInfo.Port = h.targetPort | ||
| targetPod := &types.PodMetrics{Pod: newPodInfo, MetricsState: target.GetMetrics().Clone()} | ||
| newResult.TargetPods = append(newResult.TargetPods, targetPod) | ||
| } | ||
| modifiedResults := map[string]*types.ProfileRunResult{singleProfileName: &newResult} |
There was a problem hiding this comment.
not sure I got the idea of why we need newResult or modifiedResults.
can't we iterate over range profileResult.TargetPods and update only the port to PrimaryPort?
There was a problem hiding this comment.
I don't want to depend on the fact that somewhere up the stack the PodInfo that the plugin received was a clone....
| kind: EndpointPickerConfig | ||
| plugins: | ||
| - type: prefix-cache-scorer | ||
| - type: decode-filter |
There was a problem hiding this comment.
this is a bit confusing cause dp is currently used without pd.
I understand why it's used, but maybe there is a less confusing alternative.
There was a problem hiding this comment.
That filter is in all of the configuration files here. I actually don't know why....
8eee855 to
7605b33
Compare
|
For proxies that support multi-port, will we be able to use the canonical approach of connecting to the individual ports directly? |
|
@ahg-g wrote:
Yes. The code is this PR is basically a work around for Gateways like Istio, for now, that don't support routing to multiple ports in the same vLLM pod. If your Gateway, like KGateway, supports multi-port, then simply don't use the new data-parallel-profile-handler plugin. There will be a follow up PR that will add PD support for Data Parallel in the sidecar and in the profile handler. I will make sure that any changes there are optional. |
2e66fd5 to
d1c8b60
Compare
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
d1c8b60 to
e30494a
Compare
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
|
/lgtm |
This PR adds support to the llm-d-inference-scheduler for vLLM's Data Parallel (DP) feature. This vLLM feature enables the running of multiple HTTP Servers in the same container.
In particular this PR attempts to work around the Istio issue 57638
This PR adds a new profile handler plugin
data-parallel-profile-handlerwhich needs to be used to enable DP when Disambiguated Prefill/Decode (PD) isn't used. This plugin changes the scheduling result to always schedule the request on the "base" port on the vLLM decode pod and adds a header for the sidecar.Additionally the sidecar has been extended to work with the header added by the new
data-parallel-profile-handlerplugin. This is a work around to deal with Istio issue 57638 in which we can only route to teh "base" port of vLLM pods. It also always proxies all direct HTTP calls to the "extra" HTTP servers started by vLLM.Ref #380