Our sample models are packaged as Kustomize overlays that deploy:
| Resource | Purpose |
|---|---|
| LLMInferenceService | The LLM workload — the actual inference service (simulator, vLLM, etc.) |
| MaaSModelRef | Gives the MaaS system a reference to the model so it appears in the model catalog |
| MaaSAuthPolicy | Grants access to the model for specified groups (who can use it) |
| MaaSSubscription | Defines rate limits (token quotas) for specific groups |
For more detail on each resource, see Access and Quota Overview.
!!! tip "Create llm namespace (optional)"
Our example models deploy to the llm namespace. If it does not exist, create it before deploying the samples below (idempotent—safe to run even if it already exists):
```bash
kubectl create namespace llm --dry-run=client -o yaml | kubectl apply -f -
```
Deploying a model through MaaS follows a specific order. Each resource depends on the previous one. The following walkthrough deploys the simulator model step by step so you can see what each resource does.
Set the project root (run from the repository root):
PROJECT_DIR=$(git rev-parse --show-toplevel)The LLMInferenceService is the actual inference workload. It must exist first and use the maas-default-gateway gateway reference so traffic flows through MaaS for authentication and rate limiting.
kustomize build ${PROJECT_DIR}/docs/samples/maas-system/free/llm/ | kubectl apply -f -This deploys the simulator workload (a lightweight mock that generates responses without a real LLM). The resource is named facebook-opt-125m-simulated in the llm namespace. Verify it is ready:
kubectl get llminferenceservice -n llm
kubectl get pods -n llmThe MaaSModelRef registers the model with MaaS so it appears in the catalog and the /v1/models API. It references the LLMInferenceService by name. The maas-controller watches MaaSModelRefs and populates status.endpoint and status.phase from the underlying LLMInferenceService.
kubectl apply -f ${PROJECT_DIR}/docs/samples/maas-system/free/maas/maas-model.yamlAfter a short moment, the controller reconciles. Verify status is populated:
kubectl get maasmodelref -n llm facebook-opt-125m-simulated -o jsonpath='{.status.phase}' && echo
kubectl get maasmodelref -n llm facebook-opt-125m-simulated -o jsonpath='{.status.endpoint}' && echoExpected output: status.phase should be Ready and status.endpoint should be a non-empty URL. If either is missing, wait briefly and retry—the controller may still be reconciling (see Verify Model Deployment below).
The MaaSSubscription defines token rate limits (quotas) for groups. It references the MaaSModelRef by name and namespace. This controls how many tokens each group can consume per model.
Create the models-as-a-service namespace if it does not exist, then apply:
kubectl create namespace models-as-a-service --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f ${PROJECT_DIR}/docs/samples/maas-system/free/maas/maas-subscription.yamlThis sample grants system:authenticated (all authenticated users) a limit of 100 tokens per minute for the simulator model.
The MaaSAuthPolicy defines who can access the model. It references the MaaSModelRef by name and namespace. Without this, requests to the model are denied even if the user has a subscription.
kubectl apply -f ${PROJECT_DIR}/docs/samples/maas-system/free/maas/maas-auth-policy.yamlThis sample grants access to system:authenticated. The maas-controller creates per-model AuthPolicies and TokenRateLimitPolicies that enforce this.
You have now deployed the full simulator stack manually. The sections below deploy all required objects (Model, ModelRef, Subscription, AuthPolicy) together using a single Kustomize command for each sample.
A lightweight mock service for testing that generates responses without running an actual language model. This sample deploys the full MaaS stack:
- LLMInferenceService — Simulator workload
- MaaSModelRef — Registers the model with MaaS
- MaaSAuthPolicy — Access for
system:authenticated(all authenticated users) - MaaSSubscription — Rate limit of 100 tokens/min for
system:authenticated
PROJECT_DIR=$(git rev-parse --show-toplevel)
kustomize build ${PROJECT_DIR}/docs/samples/maas-system/free/ | kubectl apply -f -Same simulator workload with premium access and higher rate limits:
- LLMInferenceService — Simulator workload
- MaaSModelRef — Registers the model with MaaS
- MaaSAuthPolicy — Access for
premium-usergroup only - MaaSSubscription — Rate limit of 1000 tokens/min for
premium-user
PROJECT_DIR=$(git rev-parse --show-toplevel)
kustomize build ${PROJECT_DIR}/docs/samples/maas-system/premium/ | kubectl apply -f -An inference deployment that loads and runs a 125M parameter model without the need for a GPU. This sample deploys the full MaaS stack:
- LLMInferenceService — vLLM CPU workload
- MaaSModelRef — Registers the model with MaaS
- MaaSAuthPolicy — Access for
system:authenticated(all authenticated users) - MaaSSubscription — Rate limit of 100 tokens/min for
system:authenticated
PROJECT_DIR=$(git rev-parse --show-toplevel)
kustomize build ${PROJECT_DIR}/docs/samples/maas-system/facebook-opt-125m-cpu/ | kubectl apply -f -nvidia.com/gpu resources available in your cluster. This sample deploys the full MaaS stack:
- LLMInferenceService — vLLM GPU workload
- MaaSModelRef — Registers the model with MaaS
- MaaSAuthPolicy — Access for
system:authenticated(all authenticated users) - MaaSSubscription — Rate limit of 100 tokens/min for
system:authenticated
PROJECT_DIR=$(git rev-parse --show-toplevel)
kustomize build ${PROJECT_DIR}/docs/samples/maas-system/qwen3/ | kubectl apply -f -# Check LLMInferenceService status
kubectl get llminferenceservices -n llm
# Check pods
kubectl get pods -n llmValidate MaaSModelRef status — The MaaS controller populates status.endpoint and status.phase on each MaaSModelRef from the LLMInferenceService. The MaaSModelRef status.endpoint should match the URL exposed by the LLMInferenceService (via the gateway). Verify:
# Check MaaSModelRef status (same namespace as the LLMInferenceService, e.g. llm)
kubectl get maasmodelref -n llm -o wide
# Verify status.endpoint is populated and phase is Ready
kubectl get maasmodelref -n llm -o jsonpath='{range .items[*]}{.metadata.name}: phase={.status.phase} endpoint={.status.endpoint}{"\n"}{end}'
# Compare with LLMInferenceService — status.endpoint should match the URL from LLMIS status.addresses or status.url
kubectl get llminferenceservice -n llm -o yaml | grep "url:"The status.endpoint on MaaSModelRef is derived from the LLMInferenceService (gateway-external URL, or status.addresses, or status.url). Both should show the same URL. You can also confirm via the Validation guide—the /v1/models API returns the same URL from MaaSModelRef status.endpoint. If phase is not Ready or endpoint is empty, the MaaS controller may still be reconciling—wait a minute and recheck.
To expose an existing model through MaaS, you must:
- Ensure the
LLMInferenceServiceuses themaas-default-gatewaygateway - Create a MaaSModelRef that references the LLMInferenceService
- Create MaaSAuthPolicy and MaaSSubscription to define access and rate limits
See Quota and Access Configuration for step-by-step instructions.
Gateway reference — If the model does not yet use the MaaS gateway:
kubectl patch llminferenceservice my-production-model -n llm --type='json' -p='[
{
"op": "add",
"path": "/spec/gateway/refs/-",
"value": {
"name": "maas-default-gateway",
"namespace": "openshift-ingress"
}
}
]'apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-production-model
spec:
gateway:
refs:
- name: maas-default-gateway
namespace: openshift-ingressProceed to Validation to test and verify your deployment.