Deploy multiple vLLM instances across separate GPU nodes with llm-d orchestrating intelligent request routing (prefix-cache-aware, queue-based, active-request scoring). Optionally integrate with Llama Stack for the Responses API.
Client → Llama Stack (optional) → Gateway → llm-d scheduler → vLLM pods (N x GPU nodes)
llm-d is a routing and orchestration layer, not a model-sharding tool. Each vLLM instance holds a complete copy of the model. llm-d intelligently routes requests across replicas to maximize KV cache hits and balance load.
| Requirement | Description |
|---|---|
| OpenShift cluster | 4.19+ with ROSA HCP or self-managed |
| RHOAI 3.4+ | Red Hat OpenShift AI operator (fast-3.x channel) |
| NFD Operator | Node Feature Discovery — detects GPU hardware. Create a NodeFeatureDiscovery instance after installing. |
| NVIDIA GPU Operator | Installs drivers and device plugins. Create a ClusterPolicy after installing. |
| Red Hat Connectivity Link | Provides Kuadrant/Authorino for auth — required by RHOAI 3.4 MaaS. Install from OperatorHub. |
| Authorino TLS configured | Authorino must have TLS enabled with a valid certificate. See RHOAI documentation. |
| PostgreSQL database | MaaS controller requires a maas-db-config secret in redhat-ods-applications with a DB_CONNECTION_URL key pointing to a PostgreSQL instance. See RHOAI documentation. |
| User Workload Monitoring | Must be enabled on the cluster for MaaS metrics. See OpenShift monitoring documentation. |
| Models-as-a-Service enabled | Must be enabled in the DataScienceCluster (see below) |
| Red Hat registry pull secret | Access to registry.redhat.io for vLLM images |
Note: The MaaS controller will fail to start if Authorino TLS, the PostgreSQL database secret, or User Workload Monitoring are not configured. Check the
maas-controllerpod logs for specific error messages if provisioning fails.
oc patch dsc default-dsc --type merge \
-p '{"spec":{"components":{"kserve":{"modelsAsService":{"managementState":"Managed"}}}}}'The following values are specific to your environment. Replace all <PLACEHOLDER> values in the commands and YAML files below with your own.
| Placeholder | Description | Example |
|---|---|---|
<CLUSTER_NAME> |
Your ROSA cluster name | my-cluster |
<REGION> |
AWS region where the cluster runs | us-east-2 |
<MODEL_URI> |
HuggingFace model URI (must fit on a single GPU) | hf://openai/gpt-oss-20b |
<MODEL_NAME> |
Model identifier used in API requests | openai/gpt-oss-20b |
<REPLICAS> |
Number of vLLM replicas (one per GPU node) | 6 |
<INSTANCE_TYPE> |
GPU instance type for the node pool | g6.xlarge (1x L4 23GB) |
<NODE_POOL_NAME> |
Label for GPU node pool scheduling | gpu-llmd-nodes |
<PULL_SECRET_FILE> |
Path to your Red Hat registry pull secret YAML | my-pull-secret.yaml |
<PULL_SECRET_NAME> |
Name of the pull secret in the cluster | my-pull-secret |
<VLLM_IMAGE> |
vLLM container image — get the latest digest from the Red Hat Ecosystem Catalog | registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:... |
<GATEWAY_HOST> |
The maas-default-gateway external hostname (auto-assigned ELB or manually configured) | a1b2c3.us-east-2.elb.amazonaws.com |
<LLAMASTACK_ROUTE> |
Llama Stack external route hostname (if using Llama Stack) | llamastack-route-redhat-ods-applications.apps.example.com |
Choose a model that fits entirely on a single GPU — llm-d does not shard models across GPUs. Each replica loads the full model.
| Model | Size | Min GPU VRAM | Notes |
|---|---|---|---|
openai/gpt-oss-20b |
~20B params | ~16 GB (quantized) | Tested with this guide on L4 (23GB) nodes |
meta-llama/Llama-3.1-8B-Instruct |
8B params | ~8 GB (FP8) | Smaller model, works on most GPU types |
Set <MODEL_URI> to hf://<model_name> (e.g., hf://openai/gpt-oss-20b) and <MODEL_NAME> to the model identifier used in API requests (e.g., openai/gpt-oss-20b).
Create a machine pool with GPU nodes. Each node should have at least one GPU with enough VRAM to hold your model.
rosa create machinepool --cluster <CLUSTER_NAME> \
--name <NODE_POOL_NAME> \
--instance-type <INSTANCE_TYPE> \
--replicas <REPLICAS> \
--labels "node-pool=<NODE_POOL_NAME>" \
--region <REGION>Wait for all nodes to be ready and GPUs detected:
oc get nodes -l node-pool=<NODE_POOL_NAME> \
-o custom-columns='NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu,GPU_PRODUCT:.metadata.labels.nvidia\.com/gpu\.product'Apply your Red Hat registry pull secret and link it to the default service account:
oc apply -f <PULL_SECRET_FILE> -n redhat-ods-applications
oc secrets link default <PULL_SECRET_NAME> --for=pull -n redhat-ods-applicationsDeploy the LLMInferenceService in the redhat-ods-applications namespace. This is required because the data-science-gateway only allows routes from this namespace.
A ready-to-use YAML template is at infrastructure/llm-d/llminferenceservice.yaml. Edit the placeholders and apply:
oc apply -f infrastructure/llm-d/llminferenceservice.yamlImportant: spec.router.scheduler: {} must be explicitly set. Without it, the controller skips scheduler/router creation and you get vLLM pods but no intelligent routing.
Gateway choice: The YAML uses maas-default-gateway instead of data-science-gateway. The data-science-gateway has an OAuth proxy designed for browser access, which blocks programmatic API calls. The maas-default-gateway has no OAuth proxy and is suitable for API clients like Llama Stack.
The default RHOAI network policy blocks port 8000. Apply the required network policies:
oc apply -f infrastructure/llm-d/network-policies.yamlSee infrastructure/llm-d/network-policies.yaml for details.
If integrating with Llama Stack, set these environment variables in your Llama Stack deployment:
| Variable | Value | Notes |
|---|---|---|
VLLM_URL |
http://<GATEWAY_HOST>/redhat-ods-applications/<SERVICE_NAME>/v1 |
Must end with /v1 — the OpenAI SDK appends /models, /chat/completions, etc. to this base URL |
VLLM_TLS_VERIFY |
false |
Required when using self-signed certs |
VLLM_API_KEY |
(leave empty) | Not needed — the maas-default-gateway has no auth |
To find your <GATEWAY_HOST>:
oc get llminferenceservice <SERVICE_NAME> -n redhat-ods-applications -o jsonpath='{.status.url}'If deploying Llama Stack in the same namespace with an external route, apply the Llama Stack network policy from network-policies.yaml to allow ingress from the OpenShift router.
# LLMInferenceService should show Ready: True
oc get llminferenceservice -n redhat-ods-applications
# Expected: N vLLM pods (1/1 Running) + 1 router-scheduler pod (2/2 Running)
oc get pods -n redhat-ods-applications | grep <SERVICE_NAME>
# InferencePool should exist
oc get inferencepool -n redhat-ods-applications# Direct to llm-d gateway
curl -s http://<GATEWAY_HOST>/redhat-ods-applications/<SERVICE_NAME>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<MODEL_NAME>","messages":[{"role":"user","content":"Hello"}],"max_tokens":10}'
# Through Llama Stack (if configured)
curl -s https://<LLAMASTACK_ROUTE>/v1/responses \
-H "Content-Type: application/json" \
-d '{"model":"<MODEL_NAME>","input":"Who is the president of the United States?"}'Use the included test script to validate llm-d's intelligent routing:
uv run --with aiohttp python infrastructure/llm-d/test_distribution.py \
--url http://<GATEWAY_HOST>/redhat-ods-applications/<SERVICE_NAME>/v1 \
--model <MODEL_NAME>See infrastructure/llm-d/test_distribution.py for details.
- llm-d is a routing/orchestration layer. Each vLLM instance holds a complete copy of the model. llm-d routes requests intelligently (prefix-cache-aware, queue-based, active-request scoring) rather than round-robin.
- LLMInferenceService is the RHOAI-native way to deploy llm-d. It manages vLLM pods, the scheduler, InferencePool, and HTTPRoute automatically.
- RHOAI 3.4+ MaaS requires Red Hat Connectivity Link (Kuadrant/Authorino) for auth and rate limiting.
- Use
maas-default-gatewayfor programmatic/API access. Thedata-science-gatewayhas an OAuth proxy meant for browser access. - Custom NetworkPolicies are required. The default RHOAI network policy blocks port 8000, which vLLM uses for serving.
- Choose a model that fits on a single GPU. llm-d does not shard models across GPUs — each replica needs the full model in VRAM.