This guide demonstrates how to deploy DeepSeek-R1-0528 using vLLM's P/D disaggregation support with NIXL in a wide expert parallel pattern with LeaderWorkerSets. This guide has been validated on:
- a 32xH200 cluster with InfiniBand networking
- a 32xH200 cluster on GKE with RoCE networking
- a 32xB200 cluster on GKE with RoCE networking
WARNING: We are still investigating and optimizing performance for other hardware and networking configurations
In this example, we will demonstrate a deployment of DeepSeek-R1-0528 with:
- 1 DP=16 Prefill Worker
- 1 DP=16 Decode Worker
This guide requires 32 Nvidia H200 or B200 GPUs and InfiniBand or RoCE RDMA networking. Check modelserver/base/decode.yaml and modelserver/base/prefill.yaml for detailed resource requirements.
-
Have the proper client tools installed on your local system to use this guide.
-
Ensure your cluster infrastructure is sufficient to deploy high scale inference
- You must have high speed inter-accelerator networking
- The pods leveraging inter-node EP must be deployed in a cluster environment with full mesh network connectivity.
- NOTE: The DeepEP backend used in WideEP requires All-to-All RDMA connectivity. Every NIC on a host must be able to communicate with every NIC on all other hosts. Networks restricted to communicating only between matching NIC IDs (rail-only connectivity) will fail.
- You have deployed the LeaderWorkerSet optional controller
-
Configure and deploy your Gateway control plane.
-
Have the Monitoring stack installed on your system.
-
Create a namespace for installation.
export NAMESPACE=llm-d-wide-ep # or any other namespace (shorter names recommended) kubectl create namespace ${NAMESPACE}
-
Create the
llm-d-hf-tokensecret in your target namespace with the keyHF_TOKENmatching a valid HuggingFace token to pull models.
cd guides/wide-ep-lws/GKE and CoreWeave are tested Kubernetes providers for this well-lit path. You can customize the manifests if you run on other Kubernetes providers.
kubectl apply -k ./manifests/modelserver/gke -n ${NAMESPACE}# Deploy on GKE for B200 on the a4 instance type to work around a known vLLM memory issue
kubectl apply -k ./manifests/modelserver/gke-a4 -n ${NAMESPACE}kubectl apply -k ./manifests/modelserver/coreweave -n ${NAMESPACE}Select the provider-specific Helm command using the tabs below.
Warning
kgateway is deprecated in llm-d and will be removed in the next release. Prefer agentgateway for new self-installed inference deployments. The current Gateway API Inference Extension chart uses provider.name=none for the agentgateway path; see the upstream inferencepool chart values for v1.4.0.
helm install llm-d-infpool \
-n ${NAMESPACE} \
-f ./manifests/inferencepool.values.yaml \
--set "provider.name=gke" \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.4.0helm install llm-d-infpool \
-n ${NAMESPACE} \
-f ./manifests/inferencepool.values.yaml \
--set "provider.name=istio" \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.4.0helm install llm-d-infpool \
-n ${NAMESPACE} \
-f ./manifests/inferencepool.values.yaml \
--set "provider.name=none" \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.4.0Deploy the Gateway and HTTPRoute using the gateway recipe.
To see what gateway options are supported refer to our gateway provider prereq doc. Gateway configurations per provider are tracked in the gateway-configurations directory.
You can also customize your gateway, for more information on how to do that see our gateway customization docs.
As with PD, the wide-ep-lws guide supports selective PD. For information on this refer to this section of the PD docs.
- Firstly, you should be able to list all helm releases installed into your chosen namespace:
helm list -n ${NAMESPACE}
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
llm-d-infpool llm-d-wide-ep 1 2025-08-24 13:14:53.355639 -0700 PDT deployed inferencepool-v1.4.0 v0.3.0- Out of the box with this example you should have the following resources (if using Istio):
kubectl get all -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
pod/infra-wide-ep-inference-gateway-istio-74d5c66c86-h5mfn 1/1 Running 0 2m22s
pod/wide-ep-llm-d-decode-0 2/2 Running 0 2m13s
pod/wide-ep-llm-d-decode-0-1 2/2 Running 0 2m13s
pod/llm-d-infpool-epp-84dd98f75b-r6lvh 1/1 Running 0 2m14s
pod/wide-ep-llm-d-prefill-0 1/1 Running 0 2m13s
pod/wide-ep-llm-d-prefill-0-1 1/1 Running 0 2m13s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/infra-wide-ep-inference-gateway-istio ClusterIP 10.16.1.34 10.16.4.2 15021:30312/TCP,80:33662/TCP 2m22s
service/wide-ep-ip-1e480070 ClusterIP None <none> 54321/TCP 2d4h
service/wide-ep-llm-d-decode ClusterIP None <none> <none> 2m13s
service/llm-d-infpool-epp ClusterIP 10.16.1.137 <none> 9002/TCP 2d4h
service/wide-ep-llm-d-prefill ClusterIP None <none> <none> 2m13s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/infra-wide-ep-inference-gateway-istio 1/1 1 1 2m22s
deployment.apps/llm-d-infpool-epp 1/1 1 1 2m14s
NAME DESIRED CURRENT READY AGE
replicaset.apps/infra-wide-ep-inference-gateway-istio-74d5c66c86 1 1 1 2m22s
replicaset.apps/llm-d-infpool-epp-55bb9857cf 1 1 1 2m14s
NAME READY AGE
statefulset.apps/wide-ep-llm-d-decode 1/1 2m13s
statefulset.apps/wide-ep-llm-d-decode-0 1/1 2m13s
statefulset.apps/wide-ep-llm-d-prefill 1/1 2m13s
statefulset.apps/wide-ep-llm-d-prefill-1 1/1 2m13sNOTE: This assumes no other guide deployments in your given ${NAMESPACE} and you have not changed the default release names via the ${RELEASE_NAME} environment variable.
For instructions on getting started making inference requests see our docs
NOTE: This example particularly benefits from utilizing stern as described in the getting-started-inferencing docs, because while we only have 3 inferencing pods, it has 16 vllm servers or ranks.
NOTE: Compared to the other examples, this one takes anywhere between 7-10 minutes for the vllm API servers to startup so this might take longer before you can interact with this example.
We deployed the default wide-ep-lws user guide on GKE (./manifests/modelserver/gke-a4).
- Provider: GKE
- Prefill: 1 instance with EP=16
- Decode: 1 instance with EP=16
- 4
a4-highgpu-8gVMs, 32 GPUs
We use the inference-perf benchmark tool to generate random datasets with 1K input length and 1K output length. This benchmark targets batch use case and we aim to find the maximum throughput by sweeping from lower to higher request rates up to 250 QPS.
- Deploy the wide-ep-lws stack following the Installation steps above. Once the stack is ready, obtain the gateway IP:
export GATEWAY_IP=$(kubectl get gateway/llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')- Follow the benchmark guide to deploy the benchmark tool and analyze the benchmark results. Notably, select the corresponding benchmark template:
export BENCHMARK_TEMPLATE="${BENCH_TEMPLATE_DIR}"/wide_ep_template.yaml
At request rate 250, we achieved the max throughput:
"throughput": {
"input_tokens_per_sec": 51218.79261732335,
"output_tokens_per_sec": 49783.58426326592,
"total_tokens_per_sec": 101002.37688058926,
"requests_per_sec": 50.02468992880545
}
This equals to 3200 input tokens/s/GPU and 3100 output tokens/s/GPU.
To remove the deployment:
# From examples/wide-ep-lws
helm uninstall llm-d-infpool -n ${NAMESPACE}
kubectl delete -k ./manifests/modelserver/<gke|coreweave> -n ${NAMESPACE}
# Supported self-installed inference gateway recipe paths are agentgateway (preferred) and kgateway (deprecated migration path).
kubectl delete -k ../recipes/gateway/<gke-l7-regional-external-managed|istio|agentgateway|agentgateway-openshift|kgateway|kgateway-openshift> -n ${NAMESPACE}For information on customizing a guide and tips to build your own, see our docs

