Provide a solution for service discovery of vLLM in fma using an InferencePool API.

### Background of the Problem
In FMA, vLLM processes are dynamically started and put to sleep within the launcher. This results in the following two scenarios:
- A group of vLLM instances for the same model may listen on different ports across different launchers.
- The same launcher may dynamically serve different models.
However, in GAIE's InferencePool API, inference servers are discovered using fixed label selectors and predefined port declarations.

### Proposed Solution

Through the following two-part task, the inference gateway can smoothly discover and connect to the vLLM instance managed by fma:

#### Task 1
Enable the dual-pod-controller to dynamically add, remove, or modify labels related to the InferenceServerConfig for the corresponding launcher when a vLLM instance is created, put to sleep, or deleted.
For example:
When we have the following InferenceServerConfig:
```yaml
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceServerConfig
metadata:
  name: qwen
spec:
  launcherConfigName: qwen-launcher
  modelServerConfig:
    options: "--model /models/Qwen2.5-1.5B-Instruct --port 8007 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.8 --enforce-eager"
    env_vars:
      VLLM_LOGGING_LEVEL: DEBUG
```
The dual-pod-controller adds the following labels to the launcher to establish the association between the InferencePool and the specific vLLM inference server:
```yaml
labels:
  inference-server-config: qwen
```

Next, the dual-pod-controller adds the following annotation to the launcher to help EPP determine the specific service port:
Selection deleted
```yaml
annotations:
  inference.networking.x-k8s.io/port-discovery: '8007'
```

This requires the dual-pod-controller to recognize the new `InferenceServerConfig` API, create a vLLM instance in the launcher based on the `InferenceServerConfig`, and assign a listening port on the launcher for the vLLM instance.

---
#### Task Two
We need to propose a change to the upstream **gateway-api-inference-extension** community so that the EPP (Endpoint Picker) can dynamically discover the different ports serving traffic on the Pod.
We can reuse the multi-port support provided by the **InferencePool API**. Below is an example of an InferencePool configuration:
```yaml
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: qwen-pool
spec:
  selector:
    matchLabels:
      inference-server-config: qwen
  targetPorts:
    - number: 8007
    - number: 8008
    - number: 8009
    - number: 8010
    - number: 8011
    - number: 8012
    - number: 8013
    - number: 8014
```
#### Key Points:
1. The label selector of the **InferencePool** can identify the launcher providing services based on the labels dynamically added to the Pod by the **dual-pod-controller**.
2. The **targetPorts** in the **InferencePool** provide a list of vLLM service ports:
- **Case A**: When the backend Pod does not have the `inference.networking.x-k8s.io/port-discovery` annotation, everything remains as per the existing logic. This means that the **targetPorts** are treated with an "AND" logic, and each port is assumed to be capable of serving traffic.
- **Case B**: When the backend Pod has the `inference.networking.x-k8s.io/port-discovery` annotation, this annotation filters the list of ports declared in the **InferencePool**. For example, if the annotation is `inference.networking.x-k8s.io/port-discovery: 8007,8008`, it indicates that only ports 8007 and 8008 are actively serving traffic. In this case, the **targetPorts** in the **InferencePool** serve only as a range definition for the service ports (as per the comment in https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/1965#issuecomment-3661617528).

This approach should address the issue we are currently discussing with the upstream community (https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/1965) while maintaining minimal changes to the upstream, ensuring that the proposed changes are accepted as quickly as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide a solution for service discovery of vLLM in fma using an InferencePool API. #200

Background of the Problem

Proposed Solution

Task 1

Task Two

Key Points:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide a solution for service discovery of vLLM in fma using an InferencePool API. #200

Description

Background of the Problem

Proposed Solution

Task 1

Task Two

Key Points:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions