Documentation for developing the inference scheduler.
Note
Python is NOT required as of v0.5.1. Tokenization is handled by a separate UDS (Unix Domain Socket) tokenizer sidecar container. Previous versions (< v0.5.1) used embedded Python tokenizers with daulet/tokenizers bindings, but these are now deprecated.
The project uses UDS (Unix Domain Socket) tokenization. Tokenization is handled by a separate UDS tokenizer sidecar container, not by the EPP container itself. Previous embedded tokenizer approaches (daulet/tokenizers, direct Python/vLLM linking) are deprecated and no longer used.
Building the UDS tokenizer image:
make image-build-uds-tokenizerThe image is tagged as ghcr.io/llm-d/llm-d-uds-tokenizer:dev by default. Override with:
UDS_TOKENIZER_TAG=v1.0.0 make image-build-uds-tokenizerThe following deployment creates a Kubernetes in Docker (KIND) cluster with an inference scheduler using a Gateway API implementation, connected to the vLLM simulator. To run the deployment, use the following command:
make env-dev-kindThis will create a kind cluster (or re-use an existing one) using the system's
local container runtime and deploy the development stack into the default
namespace.
Note
You can download the image locally using docker pull ghcr.io/llm-d/llm-d-inference-sim:latest, and the script will load it from your local Docker registry.
There are several ways to access the gateway:
Port forward:
$ kubectl --context llm-d-inference-scheduler-dev port-forward service/inference-gateway 8080:80NodePort
# Determine the k8s node address
$ kubectl --context llm-d-inference-scheduler-dev get node -o yaml | grep address
# The service is accessible over port 80 of the worker IP address.LoadBalancer
# Install and run cloud-provider-kind:
$ go install sigs.k8s.io/cloud-provider-kind@latest && cloud-provider-kind &
$ kubectl --context llm-d-inference-scheduler-dev get service inference-gateway
# Wait for the LoadBalancer External-IP to become available. The service is accessible over port 80.You can now make requests matching the IP:port of one of the access mode above:
$ curl -s -w '\n' http://<IP:port>/v1/completions -H 'Content-Type: application/json' -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jqBy default the created inference gateway, can be accessed on port 30080. This can be overridden to any free port in the range of 30000 to 32767, by running the above command as follows:
KIND_GATEWAY_HOST_PORT=<selected-port> make env-dev-kindWhere: <selected-port> is the port on your local machine you want to use to access the inference gatyeway.
Note
If you require significant customization of this environment beyond
what the standard deployment provides, you can use the deploy/components
with kustomize to build your own highly customized environment. You can use
the deploy/environments/kind deployment as a reference for your own.
To test your changes to llm-d-inference-scheduler in this environment, make your changes locally
and then re-run the deployment:
make env-dev-kindThis will build images with your recent changes and load the new images to the
cluster. By default the image tag will be dev. It will also load llm-d-inference-sim image.
Note
The built image tag can be specified via the EPP_TAG environment variable so it is used in the deployment. For example:
EPP_TAG=0.0.4 make env-dev-kindNote
If you want to load a different tag of llm-d-inference-sim, you can use the environment variable VLLM_SIMULATOR_TAG to specify it.
Note
If you are working on a MacOS with Apple Silicon, it is required to add the environment variable GOOS=linux.
Then do a rollout of the EPP Deployment so that your recent changes are
reflected:
kubectl rollout restart deployment food-review-endpoint-pickerA Kubernetes cluster can be used for development and testing. The setup can be split in two:
- cluster-level infrastructure deployment (e.g., CRDs), and
- deployment of development environments on a per-namespace basis
This enables cluster sharing by multiple developers. In case of private/personal
clusters, the default namespace can be used directly.
Caution
In shared cluster situations you should probably not be running this unless you're the cluster admin and you're certain that you should be running this, as this can be disruptive to other developers in the cluster.
The following will deploy all the infrastructure-level requirements (e.g. CRDs, Operators, etc.) to support the namespace-level development environments:
Install Gateway API + GIE CRDs:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yamlInstall kgateway:
KGTW_VERSION=v2.0.2
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=trueFor more details, see the Gateway API inference Extension getting started guide
Note
This setup is currently very manual in regards to container images for the VLLM simulator and the EPP. It is expected that you build and push images for both to your own private registry. In future iterations, we will be providing automation around this to make it simpler.
To deploy a development environment to the cluster, you'll need to explicitly
provide a namespace. This can be default if this is your personal cluster,
but on a shared cluster you should pick something unique. For example:
export NAMESPACE=annas-dev-environmentCreate the namespace:
kubectl create namespace ${NAMESPACE}Set the default namespace for kubectl commands
kubectl config set-context --current --namespace="${NAMESPACE}"Note
If you are using OpenShift (oc CLI), you can use the following instead: oc project "${NAMESPACE}"
- Set Hugging Face token variable:
export HF_TOKEN="<HF_TOKEN>"Download the llm-d-kv-cache repository (the installation script and Helm chart to install the vLLM environment):
cd .. && git clone git@github.com:llm-d/llm-d-kv-cache.gitIf you prefer to clone it into the /tmp directory, make sure to update the VLLM_CHART_DIR environment variable:
export VLLM_CHART_DIR=<tmp_dir>/llm-d-kv-cache/vllm-setup-helm
Once all this is set up, you can deploy the environment:
make env-dev-kubernetesThis will deploy the entire stack to whatever namespace you chose.
Note
The model and images of each component can be replaced. See Environment Configuration for model settings.
You can test by exposing the inference gateway via port-forward:
kubectl port-forward service/inference-gateway 8080:80 -n "${NAMESPACE}"And making requests with curl:
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","prompt":"hi","max_tokens":10,"temperature":0}' | jqNote
If the response is empty or contains an error, jq may output a cryptic error. You can run the command without jq to debug raw responses.
1. Setting the EPP image registry and tag:
You can optionally set a custom image registry and tag (otherwise, defaults will be used):
export IMAGE_REGISTRY="<YOUR_REGISTRY>"
export EPP_TAG="<YOUR_TAG>"Note
The full image reference will be constructed as ${EPP_IMAGE}, where EPP_IMAGE defaults to ${IMAGE_REGISTRY}/llm-d-inference-scheduler:{EPP_TAG}. For example, with IMAGE_REGISTRY=quay.io/<my-id> and EPP_TAG=v1.0.0, the final image will be quay.io/<my-id>/llm-d-inference-scheduler:v1.0.0.
2. Setting the vLLM replicas:
You can optionally set the vllm replicas:
export VLLM_REPLICA_COUNT=23. Setting the model name:
You can replace the model name that will be used in the system.
export MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2If you need to deploy a larger model, update the vLLM-related parameters according to the model's requirements. For example:
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export PVC_SIZE=200Gi
export VLLM_MEMORY_RESOURCES=100Gi
export VLLM_GPU_MEMORY_UTILIZATION=0.95
export VLLM_TENSOR_PARALLEL_SIZE=2
export VLLM_GPU_COUNT_PER_INSTANCE=24. Additional environment settings:
More environment variable settings can be found in the scripts/kubernetes-dev-env.sh.
Warning
This is a very manual process at the moment. We expect to make this more automated in future iterations.
Make your changes locally and commit them. Then select an image tag based on
the git SHA and set your private registry:
export EPP_TAG=$(git rev-parse HEAD)
export IMAGE_REGISTRY="quay.io/<my-id>"Build the image and tag the image for your private registry:
make image-buildand push it:
make image-pushYou can now re-deploy the environment with your changes (don't forget all of the required environment variables):
make env-dev-kubernetesAnd test the changes.
To clean up the development environment and remove all deployed resources in your namespace, run:
make clean-env-dev-kubernetesIf you also want to remove the namespace entirely, run:
kubectl delete namespace ${NAMESPACE}To uninstall the infra-stracture development: Uninstal GIE CRDs:
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml --ignore-not-foundUninstall kgateway:
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-systemFor more details, see the Gateway API inference Extension getting started guide
The project uses a Prow-inspired ChatOps system to manage PR approvals via comment commands.
| Command | Policy | Description |
|---|---|---|
/approve |
OWNERS approvers | Approve all the files for the current PR. Adds the approve label. |
/approve cancel |
OWNERS approvers | Removes your approval on this pull-request. Removes the approve label. |
/lgtm |
OWNERS approvers | Adds the lgtm label and enables auto-merge (squash). The PR merges automatically once requiremnets below are met. |
/lgtm cancel |
OWNERS approvers | Removes the lgtm label and disables auto-merge. |
/hold |
Anyone with write access | Adds the hold label to prevent the PR from merging. |
/hold cancel |
Anyone with write access | Removes the hold label. |
For a PR to be merged, it must have:
- ✅ Both
lgtmandapprovelabels - Required for merge approval - ✅ No blocking labels - The
holdlabel must not be present - ✅ All required status checks passing - CI/CD checks must succeed
The gatekeeper workflow enforces these requirements as a required status check.
When new commits are pushed to an approved PR, the lgtm label is automatically removed and auto-merge is disabled. This ensures approvals always reflect the latest code. The author must request a new /lgtm after pushing changes.
Note: The approve label is NOT automatically removed on new commits. If significant changes are made, reviewers should use /approve cancel to remove their approval.