-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Background of the Problem
In FMA, vLLM processes are dynamically started and put to sleep within the launcher. This results in the following two scenarios:
- A group of vLLM instances for the same model may listen on different ports across different launchers.
- The same launcher may dynamically serve different models.
However, in GAIE's InferencePool API, inference servers are discovered using fixed label selectors and predefined port declarations.
Proposed Solution
Through the following two-part task, the inference gateway can smoothly discover and connect to the vLLM instance managed by fma:
Task 1
Enable the dual-pod-controller to dynamically add, remove, or modify labels related to the InferenceServerConfig for the corresponding launcher when a vLLM instance is created, put to sleep, or deleted.
For example:
When we have the following InferenceServerConfig:
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceServerConfig
metadata:
name: qwen
spec:
launcherConfigName: qwen-launcher
modelServerConfig:
options: "--model /models/Qwen2.5-1.5B-Instruct --port 8007 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.8 --enforce-eager"
env_vars:
VLLM_LOGGING_LEVEL: DEBUGThe dual-pod-controller adds the following labels to the launcher to establish the association between the InferencePool and the specific vLLM inference server:
labels:
inference-server-config: qwenNext, the dual-pod-controller adds the following annotation to the launcher to help EPP determine the specific service port:
Selection deleted
annotations:
inference.networking.x-k8s.io/port-discovery: '8007'This requires the dual-pod-controller to recognize the new InferenceServerConfig API, create a vLLM instance in the launcher based on the InferenceServerConfig, and assign a listening port on the launcher for the vLLM instance.
Task Two
We need to propose a change to the upstream gateway-api-inference-extension community so that the EPP (Endpoint Picker) can dynamically discover the different ports serving traffic on the Pod.
We can reuse the multi-port support provided by the InferencePool API. Below is an example of an InferencePool configuration:
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: qwen-pool
spec:
selector:
matchLabels:
inference-server-config: qwen
targetPorts:
- number: 8007
- number: 8008
- number: 8009
- number: 8010
- number: 8011
- number: 8012
- number: 8013
- number: 8014Key Points:
- The label selector of the InferencePool can identify the launcher providing services based on the labels dynamically added to the Pod by the dual-pod-controller.
- The targetPorts in the InferencePool provide a list of vLLM service ports:
- Case A: When the backend Pod does not have the
inference.networking.x-k8s.io/port-discoveryannotation, everything remains as per the existing logic. This means that the targetPorts are treated with an "AND" logic, and each port is assumed to be capable of serving traffic. - Case B: When the backend Pod has the
inference.networking.x-k8s.io/port-discoveryannotation, this annotation filters the list of ports declared in the InferencePool. For example, if the annotation isinference.networking.x-k8s.io/port-discovery: 8007,8008, it indicates that only ports 8007 and 8008 are actively serving traffic. In this case, the targetPorts in the InferencePool serve only as a range definition for the service ports (as per the comment in Feature Request: Allow EPP to dynamically discover ports based on matching Pods kubernetes-sigs/gateway-api-inference-extension#1965 (comment)).
This approach should address the issue we are currently discussing with the upstream community (kubernetes-sigs/gateway-api-inference-extension#1965) while maintaining minimal changes to the upstream, ensuring that the proposed changes are accepted as quickly as possible.