This folder contains example values file and their rendered templates. It assumes you have added the
llm-d-modelservice repository to Helm:
helm repo add llm-d-modelservice https://llm-d-incubation.github.io/llm-d-modelservice/
helm repo update
| Example | Description | Hardware Requirements |
|---|---|---|
values-cpu.yaml |
CPU-only inference example | Single node, no GPU required |
values-pd.yaml |
Prefill/decode disaggregation example | Multi-GPU, demonstrates P/D splitting |
values-xpu.yaml |
Intel XPU single-node example | Intel Data Center GPU Max |
pvc/ |
Persistent volume examples | Shows different storage options |
dra/ |
Dynamic Resource Allocation (DRA) examples | Shows different DRA use cases |
All the examples assume a Gateway and GAIE configuration have been deployed. See the llm-d guides for examples. Further, an HTTPRoute must be deployed. Some examples of HTTPRoute is provided below.
Dry run:
helm template cpu-sim llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-cpu.yaml
To install, use helm install instead of helm template:
helm install cpu-sim llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-cpu.yaml
Dry-run:
helm template pd llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml
or install in a cluster
helm install pd llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml
See this README.
For Intel XPU (Data Center GPU Max) deployments:
Deploy the intel-gpu-plugin daemonset.
kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=v0.30.0'
Single-node XPU deployment.
helm install llm-xpu llm-d-modelservice/llm-d-modelservice -f values-xpu.yaml --namespace llm-d --create-namespace
Get the name of decode pod.
kubectl get pods -n llm-d -l llm-d.ai/role=decode
An HTTPRoute maps requests through a Gateway to an InferencePool which is, in turn, tied (via match labels) to a particular set of model servers. Here are two examples.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: mymodel-httproute
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: INSERT_GATEWAY_NAME
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: INSERT_INFERENCEPOOL_NAME
port: 8000
weight: 1
matches:
- path:
type: PathPrefix
value: /For example, to call the OpenAI completions API, use mymodel/v1/completions
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: myhttproute
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: INSERT_GATEWAY_NAME
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: INSERT_INFERENCEPOOL_NAME
port: 8000
weight: 1
filters:
- type: URLRewrite
urlRewrite:
path:
replacePrefixMatch: /
type: ReplacePrefixMatch
matches:
- path:
type: PathPrefix
value: /mymodel/This route supports requests with the prefix mymodel/; for example, to call the OpenAI completions API, requests would be sent to: mymodel/v1/completions. The HTTPRoute maps rewrites such requests to v1/completions for the target model server.
When accelerator.dra is true, accelerator resource (gpu) requirements are specified using Dynamic Resource Allocation. In particular, the accelerator.type is used to identify a ResourceClaimTemplate to create (from accelerator.resourceClaimTemplates). The vllm containers use resources.claims instead of resources.limits to request the necessary resources. For example, see values-dra.yaml.
Differences between your environment and that in which the above examples were tested may mean the need to modify the input values files. Some common examples we are seen are:
- Is the inference gateway listed in
routing.parentRefscorrect? - Do the labels/values in
acceleratorTypesmatch those assigned to nodes in your cluster?