Name	Name	Last commit message	Last commit date
parent directory ..
pvc	pvc
README.md	README.md
output-cpu.yaml	output-cpu.yaml
output-dra.yaml	output-dra.yaml
output-gaudi.yaml	output-gaudi.yaml
output-heterogeneous-pd.yaml	output-heterogeneous-pd.yaml
output-pd-mnnvl.yaml	output-pd-mnnvl.yaml
output-pd.yaml	output-pd.yaml
output-pvc-hf.yaml	output-pvc-hf.yaml
output-pvc.yaml	output-pvc.yaml
output-rebellions-atom.yaml	output-rebellions-atom.yaml
output-requester.yaml	output-requester.yaml
output-xpu-pd.yaml	output-xpu-pd.yaml
output-xpu.yaml	output-xpu.yaml
values-cpu.yaml	values-cpu.yaml
values-dra.yaml	values-dra.yaml
values-gaudi.yaml	values-gaudi.yaml
values-heterogeneous-pd.yaml	values-heterogeneous-pd.yaml
values-pd-mnnvl.yaml	values-pd-mnnvl.yaml
values-pd.yaml	values-pd.yaml
values-rebellions-atom.yaml	values-rebellions-atom.yaml
values-requester.yaml	values-requester.yaml
values-xpu-pd.yaml	values-xpu-pd.yaml
values-xpu.yaml	values-xpu.yaml

Examples

This folder contains example values file and their rendered templates. It assumes you have added the llm-d-modelservice repository to Helm:

helm repo add llm-d-modelservice https://llm-d-incubation.github.io/llm-d-modelservice/
helm repo update

Available Examples

Example	Description	Hardware Requirements
`values-cpu.yaml`	CPU-only inference example	Single node, no GPU required
`values-pd.yaml`	Prefill/decode disaggregation example	Multi-GPU, demonstrates P/D splitting
`values-xpu.yaml`	Intel XPU single-node example	Intel Data Center GPU Max
`pvc/`	Persistent volume examples	Shows different storage options
`dra/`	Dynamic Resource Allocation (DRA) examples	Shows different DRA use cases

All the examples assume a Gateway and GAIE configuration have been deployed. See the llm-d guides for examples. Further, an HTTPRoute must be deployed. Some examples of HTTPRoute is provided below.

Usage Examples

1. CPU-only

Dry run:

helm template cpu-sim llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-cpu.yaml

To install, use helm install instead of helm template:

helm install cpu-sim llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-cpu.yaml

2. P/D disaggregation

Dry-run:

helm template pd llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml

or install in a cluster

helm install pd llm-d-modelservice/llm-d-modelservice -f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml

3. Loading a model from a PVC

See this README.

4. Intel XPU Examples

For Intel XPU (Data Center GPU Max) deployments:

Deploy the intel-gpu-plugin daemonset.

kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=v0.30.0'

Single-node XPU deployment.

helm install llm-xpu llm-d-modelservice/llm-d-modelservice -f values-xpu.yaml --namespace llm-d --create-namespace

Get the name of decode pod.

kubectl get pods -n llm-d -l llm-d.ai/role=decode

HTTPRoute Examples

An HTTPRoute maps requests through a Gateway to an InferencePool which is, in turn, tied (via match labels) to a particular set of model servers. Here are two examples.

Example: Route all requests to the same model

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mymodel-httproute
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: INSERT_GATEWAY_NAME
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: INSERT_INFERENCEPOOL_NAME
      port: 8000
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

For example, to call the OpenAI completions API, use mymodel/v1/completions

Example: Route requests with modified path

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: myhttproute
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: INSERT_GATEWAY_NAME
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: INSERT_INFERENCEPOOL_NAME
      port: 8000
      weight: 1
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          replacePrefixMatch: /
          type: ReplacePrefixMatch
    matches:
    - path:
        type: PathPrefix
        value: /mymodel/

This route supports requests with the prefix mymodel/; for example, to call the OpenAI completions API, requests would be sent to: mymodel/v1/completions. The HTTPRoute maps rewrites such requests to v1/completions for the target model server.

6. Dynamic Resource Allocation

When accelerator.dra is true, accelerator resource (gpu) requirements are specified using Dynamic Resource Allocation. In particular, the accelerator.type is used to identify a ResourceClaimTemplate to create (from accelerator.resourceClaimTemplates). The vllm containers use resources.claims instead of resources.limits to request the necessary resources. For example, see values-dra.yaml.

Troubleshooting:

Differences between your environment and that in which the above examples were tested may mean the need to modify the input values files. Some common examples we are seen are:

Is the inference gateway listed in routing.parentRefs correct?
Do the labels/values in acceleratorTypes match those assigned to nodes in your cluster?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Examples

Available Examples

Usage Examples

1. CPU-only

2. P/D disaggregation

3. Loading a model from a PVC

4. Intel XPU Examples

HTTPRoute Examples

Example: Route all requests to the same model

Example: Route requests with modified path

6. Dynamic Resource Allocation

Troubleshooting:

Uh oh!

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Examples

Available Examples

Usage Examples

1. CPU-only

2. P/D disaggregation

3. Loading a model from a PVC

4. Intel XPU Examples

HTTPRoute Examples

Example: Route all requests to the same model

Example: Route requests with modified path

6. Dynamic Resource Allocation

Troubleshooting: