Direct vLLM runs the OpenAI-compatible vllm serve server directly as Kubernetes Deployment and Service resources. Use it when you want the newest vLLM model support or need a specific vLLM launch image before a managed provider has caught up.
For a less hands-on experience, prefer a managed provider such as Dynamo, KubeRay, KAITO, or llm-d when it supports your model and serving mode.
Use Direct vLLM when:
- the model is supported by vLLM but not yet available through another provider path;
- you need to choose a vLLM nightly, stable, or custom launch image;
- an official vLLM recipe provides known-good flags for the exact Hugging Face model ID;
- you want a plain Kubernetes
Deploymentinstead of a provider-specific upstream CRD.
Avoid Direct vLLM when you need provider-managed routing, autoscaling, or production guardrails that are specific to Dynamo, KubeRay, KAITO, or llm-d.
Install the core controller first, then install the Direct vLLM provider shim.
If you have the repository checked out, use the provider Makefile (it sets the
image via kustomize and tracks the default vLLM tag from versions.env):
make controller-deploy
make -C providers/vllm deployOtherwise, apply the published manifests directly:
kubectl apply -f https://raw.githubusercontent.com/kaito-project/airunway/main/deploy/controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kaito-project/airunway/main/providers/vllm/deploy/vllm.yamlThe vLLM provider registers an InferenceProviderConfig named vllm. The Web UI shows it as Direct vLLM after registration.
For gated models, create a Kubernetes secret in the same namespace as the ModelDeployment:
kubectl create secret generic vllm-hf-token \
--from-literal=HF_TOKEN=<your-token> \
-n <model-namespace>Reference it from the deployment:
spec:
secrets:
huggingFaceToken: vllm-hf-tokenUse explicit provider selection when you specifically want Direct vLLM:
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: phi4-direct-vllm
namespace: default
spec:
provider:
name: vllm
model:
id: microsoft/Phi-4-mini-instruct
source: huggingface
engine:
type: vllm
image: vllm/vllm-openai:cu130-nightly
args:
tensor-parallel-size: "1"
resources:
gpu:
count: 1spec.engine.image is the preferred image override for Direct vLLM. The older top-level spec.image field still exists for compatibility, but do not set both to different values.
Direct vLLM renders a single OpenAI-compatible server container with production-oriented defaults:
- The vLLM server listens on
0.0.0.0:8000. - The generated container, Service, and health probes all use port
8000. - Startup, readiness, and liveness probes call the vLLM
/healthendpoint. - The startup probe has a long failure window so large models have time to load before liveness/readiness checks begin.
- Multi-GPU deployments mount a memory-backed
emptyDirat/dev/shmfor tensor-parallel execution. The default shared-memory size is20Gi.
Do not set host or port in spec.engine.args or spec.engine.extraArgs for Direct vLLM. Those flags are generated by the provider so the server, Service, and probes stay aligned.
spec.provider.overrides is a deep-merge escape hatch: its contents are merged into the generated Deployment so you can set fields the structured spec does not expose. The provider blocks the top-level apiVersion, kind, metadata, and status keys, but everything under spec.template.spec is mergeable — including securityContext, hostPath volumes, added Linux capabilities, serviceAccountName, and extra containers.
This means a user who can create a ModelDeployment can influence the resulting pod's security posture (for example, set runAsNonRoot: false or mount a host path) even if they cannot create Deployment/Pod objects directly.
Treat ModelDeployment creation as a privileged, RBAC-gated action on clusters where pod-level security matters. If lower-trust users are allowed to create ModelDeployments, gate spec.provider.overrides with an external admission policy (e.g. OPA/Gatekeeper or Kyverno) or a Pod Security Standard that rejects privileged pod specs. This escape-hatch behavior matches the other in-repo providers (llm-d, Dynamo); it is intentional, not a Direct vLLM-specific weakness.
Direct vLLM advertises aggregated serving only. Use Dynamo, KubeRay, or llm-d for disaggregated prefill/decode serving.
The provider contains internal experimental prefill/decode rendering code, but it is not advertised as a supported capability because production-grade disaggregated vLLM typically also needs router/orchestration and KV-transfer plumbing. Do not rely on Direct vLLM for disaggregated serving until that path is promoted with end-to-end coverage.
The Web UI can look up official recipes from recipes.vllm.ai for an exact Hugging Face model ID match. When you apply a recipe, Airunway materializes the recipe into normal deployment fields:
spec.engine.imagespec.engine.argsspec.engine.extraArgsspec.env- GPU resource defaults
- recipe provenance annotations under
metadata.annotations["airunway.ai/recipe.*"]
The controller does not fetch recipes during reconciliation. Recipe provenance annotations are informational; they do not replace the materialized image, args, env, or resources.
Server-side recipe alternative references are restricted to the configured recipes base URL before fetching.
Direct vLLM is explicit-only: it advertises no selection rules, so the controller never auto-selects it. Managed providers such as Dynamo and KubeRay remain the auto-selected defaults for GPU vLLM workloads. To use Direct vLLM, set the provider explicitly:
spec:
provider:
name: vllmThe provider records selected image details in status.image, including the requested image, resolved digest when available, source classification, and verification status. Digest resolution is reused when the requested image has not changed and the status already contains a resolved digest.
status.image.source is derived from the image reference, and UnsupportedImage is set only for custom:
nightly— the provider default image, or an officialvllm/vllm-openairepository tag containingnightly.stable— the officialvllm/vllm-openairepository with thelatesttag.launch— any tag containinglaunch(frozen launch-image snapshots).custom— anything else; this setsUnsupportedImage=True(the image is deployed as-is but is not recognized as a provider image).
A non-official repository or an unrecognized tag classifies as custom by design — the deployment still runs, the condition is advisory. If you ship launch images under a different tag convention, include launch in the tag (or expect a custom classification).
Be aware of two behaviors when relying on tag-based images:
- The default image is pinned to a registry digest on the first reconcile. If the container registry is unreachable at that moment, digest resolution fails and the deployment does not create a pod until resolution succeeds. This is deliberate — the default image is pinned for reproducibility — but it means the first reconcile is coupled to registry availability. If you need to deploy during a registry outage, set
spec.engine.imageto an image/tag you have already pulled. - A resolved digest is not refreshed while the requested image is unchanged. Once a tag such as
:cu130-nightlyhas been pinned to a digest, the provider keeps reusing that digest and does not re-pull the moving tag on later reconciles. To pick up a newernightlybuild, changespec.engine.image(for example, pin a dated tag or digest, then move it) so the requested image differs and digest resolution runs again.
If a deployment is stuck because the default image could not be resolved, check status.image.message and the ImageResolved condition, then either restore registry connectivity or switch to a locally available spec.engine.image.