Direct vLLM Provider

Direct vLLM runs the OpenAI-compatible vllm serve server directly as Kubernetes Deployment and Service resources. Use it when you want the newest vLLM model support or need a specific vLLM launch image before a managed provider has caught up.

For a less hands-on experience, prefer a managed provider such as Dynamo, KubeRay, KAITO, or llm-d when it supports your model and serving mode.

When to use Direct vLLM

Use Direct vLLM when:

the model is supported by vLLM but not yet available through another provider path;
you need to choose a vLLM nightly, stable, or custom launch image;
an official vLLM recipe provides known-good flags for the exact Hugging Face model ID;
you want a plain Kubernetes Deployment instead of a provider-specific upstream CRD.

Avoid Direct vLLM when you need provider-managed routing, autoscaling, or production guardrails that are specific to Dynamo, KubeRay, KAITO, or llm-d.

Install the provider shim

Install the core controller first, then install the Direct vLLM provider shim.

If you have the repository checked out, use the provider Makefile (it sets the image via kustomize and tracks the default vLLM tag from versions.env):

make controller-deploy
make -C providers/vllm deploy

Otherwise, apply the published manifests directly:

kubectl apply -f https://raw.githubusercontent.com/kaito-project/airunway/main/deploy/controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kaito-project/airunway/main/providers/vllm/deploy/vllm.yaml

The vLLM provider registers an InferenceProviderConfig named vllm. The Web UI shows it as Direct vLLM after registration.

Hugging Face token secret

For gated models, create a Kubernetes secret in the same namespace as the ModelDeployment:

kubectl create secret generic vllm-hf-token \
  --from-literal=HF_TOKEN=<your-token> \
  -n <model-namespace>

Reference it from the deployment:

spec:
  secrets:
    huggingFaceToken: vllm-hf-token

Basic deployment

Use explicit provider selection when you specifically want Direct vLLM:

apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: phi4-direct-vllm
  namespace: default
spec:
  provider:
    name: vllm
  model:
    id: microsoft/Phi-4-mini-instruct
    source: huggingface
  engine:
    type: vllm
    image: vllm/vllm-openai:cu130-nightly
    args:
      tensor-parallel-size: "1"
  resources:
    gpu:
      count: 1

spec.engine.image is the preferred image override for Direct vLLM. The older top-level spec.image field still exists for compatibility, but do not set both to different values.

Operational defaults

Direct vLLM renders a single OpenAI-compatible server container with production-oriented defaults:

The vLLM server listens on 0.0.0.0:8000.
The generated container, Service, and health probes all use port 8000.
Startup, readiness, and liveness probes call the vLLM /health endpoint.
The startup probe has a long failure window so large models have time to load before liveness/readiness checks begin.
Multi-GPU deployments mount a memory-backed emptyDir at /dev/shm for tensor-parallel execution. The default shared-memory size is 20Gi.

Do not set host or port in spec.engine.args or spec.engine.extraArgs for Direct vLLM. Those flags are generated by the provider so the server, Service, and probes stay aligned.

Advanced overrides and the trust boundary

spec.provider.overrides is a deep-merge escape hatch: its contents are merged into the generated Deployment so you can set fields the structured spec does not expose. The provider blocks the top-level apiVersion, kind, metadata, and status keys, but everything under spec.template.spec is mergeable — including securityContext, hostPath volumes, added Linux capabilities, serviceAccountName, and extra containers.

This means a user who can create a ModelDeployment can influence the resulting pod's security posture (for example, set runAsNonRoot: false or mount a host path) even if they cannot create Deployment/Pod objects directly.

Treat ModelDeployment creation as a privileged, RBAC-gated action on clusters where pod-level security matters. If lower-trust users are allowed to create ModelDeployments, gate spec.provider.overrides with an external admission policy (e.g. OPA/Gatekeeper or Kyverno) or a Pod Security Standard that rejects privileged pod specs. This escape-hatch behavior matches the other in-repo providers (llm-d, Dynamo); it is intentional, not a Direct vLLM-specific weakness.

Serving modes

Direct vLLM advertises aggregated serving only. Use Dynamo, KubeRay, or llm-d for disaggregated prefill/decode serving.

The provider contains internal experimental prefill/decode rendering code, but it is not advertised as a supported capability because production-grade disaggregated vLLM typically also needs router/orchestration and KV-transfer plumbing. Do not rely on Direct vLLM for disaggregated serving until that path is promoted with end-to-end coverage.

Official vLLM recipes

The Web UI can look up official recipes from recipes.vllm.ai for an exact Hugging Face model ID match. When you apply a recipe, Airunway materializes the recipe into normal deployment fields:

spec.engine.image
spec.engine.args
spec.engine.extraArgs
spec.env
GPU resource defaults
recipe provenance annotations under metadata.annotations["airunway.ai/recipe.*"]

The controller does not fetch recipes during reconciliation. Recipe provenance annotations are informational; they do not replace the materialized image, args, env, or resources.

Server-side recipe alternative references are restricted to the configured recipes base URL before fetching.

Auto-selection behavior

Direct vLLM is explicit-only: it advertises no selection rules, so the controller never auto-selects it. Managed providers such as Dynamo and KubeRay remain the auto-selected defaults for GPU vLLM workloads. To use Direct vLLM, set the provider explicitly:

spec:
  provider:
    name: vllm

Image status

The provider records selected image details in status.image, including the requested image, resolved digest when available, source classification, and verification status. Digest resolution is reused when the requested image has not changed and the status already contains a resolved digest.

Image source classification

status.image.source is derived from the image reference, and UnsupportedImage is set only for custom:

nightly — the provider default image, or an official vllm/vllm-openai repository tag containing nightly.
stable — the official vllm/vllm-openai repository with the latest tag.
launch — any tag containing launch (frozen launch-image snapshots).
custom — anything else; this sets UnsupportedImage=True (the image is deployed as-is but is not recognized as a provider image).

A non-official repository or an unrecognized tag classifies as custom by design — the deployment still runs, the condition is advisory. If you ship launch images under a different tag convention, include launch in the tag (or expect a custom classification).

Registry coupling and digest refresh

Be aware of two behaviors when relying on tag-based images:

The default image is pinned to a registry digest on the first reconcile. If the container registry is unreachable at that moment, digest resolution fails and the deployment does not create a pod until resolution succeeds. This is deliberate — the default image is pinned for reproducibility — but it means the first reconcile is coupled to registry availability. If you need to deploy during a registry outage, set spec.engine.image to an image/tag you have already pulled.
A resolved digest is not refreshed while the requested image is unchanged. Once a tag such as :cu130-nightly has been pinned to a digest, the provider keeps reusing that digest and does not re-pull the moving tag on later reconciles. To pick up a newer nightly build, change spec.engine.image (for example, pin a dated tag or digest, then move it) so the requested image differs and digest resolution runs again.

If a deployment is stuck because the default image could not be resolved, check status.image.message and the ImageResolved condition, then either restore registry connectivity or switch to a locally available spec.engine.image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct vLLM Provider

When to use Direct vLLM

Install the provider shim

Hugging Face token secret

Basic deployment

Operational defaults

Advanced overrides and the trust boundary

Serving modes

Official vLLM recipes

Auto-selection behavior

Image status

Image source classification

Registry coupling and digest refresh

FilesExpand file tree

vllm.md

Latest commit

History

vllm.md

File metadata and controls

Direct vLLM Provider

When to use Direct vLLM

Install the provider shim

Hugging Face token secret

Basic deployment

Operational defaults

Advanced overrides and the trust boundary

Serving modes

Official vLLM recipes

Auto-selection behavior

Image status

Image source classification

Registry coupling and digest refresh