llm-d · kfirtoledo · Jun 25, 2025 · Jun 18, 2025 · Jun 22, 2025 · Jun 23, 2025
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -19,41 +19,39 @@ Documentation for developing the inference scheduler.
 
 ## Kind Development Environment
 
-> **WARNING**: This currently requires you to have manually built the vllm
-> simulator separately on your local system. In a future iteration this will
-> be handled automatically and will not be required. The tag for the simulator
-> currently needs to be `v0.1.0`.
+The following deployment creates a [Kubernetes in Docker (KIND)] cluster with an inference scheduler using a Gateway API implementation, connected to the vLLM simulator.
+To run the deployment, use the following command:
 
-You can deploy the current scheduler with a Gateway API implementation into a
-[Kubernetes in Docker (KIND)] cluster locally with the following:
-
-```console
+```bash
 make env-dev-kind
 ```
 
 This will create a `kind` cluster (or re-use an existing one) using the system's
 local container runtime and deploy the development stack into the `default`
 namespace.
 
+> [!NOTE]
+> You can download the image locally using `docker pull ghcr.io/llm-d/llm-d-inference-sim:latest`, and the script will load it from your local Docker registry.
+
 There are several ways to access the gateway:
 
 **Port forward**:
 
-```console
+```bash
 $ kubectl --context llm-d-inference-scheduler-dev port-forward service/inference-gateway 8080:80
 ```
 
 **NodePort**
 
-```console
+```bash
 # Determine the k8s node address
 $ kubectl --context llm-d-inference-scheduler-dev get node -o yaml | grep address
 # The service is accessible over port 80 of the worker IP address.
 ```
 
 **LoadBalancer**
 
-```console
+```bash
 # Install and run cloud-provider-kind:
 $ go install sigs.k8s.io/cloud-provider-kind@latest && cloud-provider-kind &
 $ kubectl --context llm-d-inference-scheduler-dev get service inference-gateway
@@ -62,22 +60,23 @@ $ kubectl --context llm-d-inference-scheduler-dev get service inference-gateway
 
 You can now make requests macthing the IP:port of one of the access mode above:
 
-```console
+```bash
 $ curl -s -w '\n' http://<IP:port>/v1/completions -H 'Content-Type: application/json' -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
 ```
 
 By default the created inference gateway, can be accessed on port 30080. This can
 be overriden to any free port in the range of 30000 to 32767, by running the above
 command as follows:
 
-```console
+```bash
 KIND_GATEWAY_HOST_PORT=<selected-port> make env-dev-kind
 ```
 
 **Where:** &lt;selected-port&gt; is the port on your local machine you want to use to
 access the inference gatyeway.
 
-> **NOTE**: If you require significant customization of this environment beyond
+> [!NOTE]
+> If you require significant customization of this environment beyond
 > what the standard deployment provides, you can use the `deploy/components`
 > with `kustomize` to build your own highly customized environment. You can use
 > the `deploy/environments/kind` deployment as a reference for your own.
@@ -89,27 +88,245 @@ access the inference gatyeway.
 To test your changes to `llm-d-inference-scheduler` in this environment, make your changes locally
 and then re-run the deployment:
 
-```console
+```bash
 make env-dev-kind
 ```
 
 This will build images with your recent changes and load the new images to the
 cluster. By default the image tag will be `dev`. It will also load `llm-d-inference-sim` image.
 
-**NOTE:** The built image tag can be specified via the `EPP_TAG` environment variable so it is used in the deployment. For example:
+> [!NOTE]
+>The built image tag can be specified via the `EPP_TAG` environment variable so it is used in the deployment. For example:
 
-```console
+```bash
 EPP_TAG=0.0.4 make env-dev-kind
 ```
 
-**NOTE:** If you want to load a different tag of llm-d-inference-sim, you can use the environment variable `VLLM_SIMULATOR_TAG` to specify it.
+> [!NOTE]
+> If you want to load a different tag of llm-d-inference-sim, you can use the environment variable `VLLM_SIMULATOR_TAG` to specify it.
 
-**NOTE**: If you are working on a MacOS with Apple Silicon, it is required to add
-the environment variable `GOOS=linux`.
+> [!NOTE]
+> If you are working on a MacOS with Apple Silicon, it is required to add the environment variable `GOOS=linux`.
 
 Then do a rollout of the EPP `Deployment` so that your recent changes are
 reflected:
 
-```console
-kubectl rollout restart deployment endpoint-picker
+```bash
+kubectl rollout restart deployment food-review-endpoint-picker
+```
+
+## Kubernetes Development Environment
+
+A Kubernetes cluster can be used for development and testing.
+The setup can be split in two:
+
+- cluster-level infrastructure deployment (e.g., CRDs), and
+- deployment of development environments on a per-namespace basis
+
+This enables cluster sharing by multiple developers. In case of private/personal
+clusters, the `default` namespace can be used directly.
+
+### Setup - Infrastructure
+
+> [!CAUTION]
+> In shared cluster situations you should probably not be
+> running this unless you're the cluster admin and you're _certain_
+> that you should be running this, as this can be disruptive to other developers
+> in the cluster.
+
+The following will deploy all the infrastructure-level requirements (e.g. CRDs,
+Operators, etc.) to support the namespace-level development environments:
+
+Install GIE CRDs:
+
+```bash
+kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml
+```
+
+Install kgateway:
+```bash
+KGTW_VERSION=v2.0.2
+helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
+helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true
+```
+
+For more details, see the Gateway API inference Extension [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
+
+### Setup - Developer Environment
+
+> [!NOTE]
+> This setup is currently very manual in regards to container
+> images for the VLLM simulator and the EPP. It is expected that you build and
+> push images for both to your own private registry. In future iterations, we
+> will be providing automation around this to make it simpler.
+
+To deploy a development environment to the cluster, you'll need to explicitly
+provide a namespace. This can be `default` if this is your personal cluster,
+but on a shared cluster you should pick something unique. For example:
+
+```bash
+export NAMESPACE=annas-dev-environment
+```
+
+Create the namespace:
+
+```bash
+kubectl create namespace ${NAMESPACE}
+```
+
+Set the default namespace for kubectl commands
+
+```bash
+kubectl config set-context --current --namespace="${NAMESPACE}"
+```
+
+> [!NOTE]
+> If you are using OpenShift (oc CLI), you can use the following instead: `oc project "${NAMESPACE}"`
+
+- Set Hugging Face token variable:
+
+```bash
+export HF_TOKEN="<HF_TOKEN>"
+```
+
+Download the `llm-d-kv-cache-manager` repository (the instllation script and Helm chart to install the vLLM environment):
-Download the `llm-d-kv-cache-manager` repository (the instllation script and Helm chart to install the vLLM environment):
+Download the `llm-d-kv-cache-manager` repository (you'll be using the installation script and Helm
+chart from it to install the vLLM environment):
-Download the `llm-d-kv-cache-manager` repository (the instllation script and Helm chart to install the vLLM environment):
+Download the `llm-d-kv-cache-manager` repository (you'll be using the installation script and Helm
+chart from it to install the vLLM environment):
+
+```bash
+cd .. && git clone git@github.com:llm-d/llm-d-kv-cache-manager.git
+```
+
+If you prefer to clone it into the `/tmp` directory, make sure to update the `VLLM_CHART_DIR` environment variable:
+`export VLLM_CHART_DIR=<tmp_dir>/llm-d-kv-cache-manager/vllm-setup-helm`
+
+Once all this is set up, you can deploy the environment:
+
+```bash
+make env-dev-kubernetes
+```
+
+This will deploy the entire stack to whatever namespace you chose.
+> [!NOTE]
+> The model and images of each componet can  be replaced. See [Environment Configuration](#environment-configuration) for model settings.
+
+You can test by exposing the `inference gateway` via port-forward:
+
+```bash
+kubectl port-forward service/inference-gateway 8080:80 -n "${NAMESPACE}"
+```
+
+And making requests with `curl`:
+
+```bash
+curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
+  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
+```
+
+> [!NOTE]
+> If the response is empty or contains an error, jq may output a cryptic error. You can run the command without jq to debug raw responses.
+
+#### Environment Configurateion
+
+**1. Setting the EPP image and tag:**
+
+You can optionally set a custom EPP image (otherwise, the default will be used):
+
+```bash
+export EPP_TAG="<YOUR_TAG>"
+export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+```
+
+**2. Setting the vLLM replicas:**
+
+You can optionally set the vllm replicas:
+
+```bash
+export VLLM_REPLICA_COUNT=2
+```
+
+**3. Setting the model name:**
+
+You can replace the model name that will be used in the system.
+
+```bash
+export MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2
+```
+
+If you need to deploy a larger model, update the vLLM-related parameters according to the model's requirements. For example:
+
+```bash
+export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
+export PVC_SIZE=200Gi
+export VLLM_MEMORY_RESOURCES=100Gi
+export VLLM_GPU_MEMORY_UTILIZATION=0.95
+export VLLM_TENSOR_PARALLEL_SIZE=2
+export VLLM_GPU_COUNT_PER_INSTANCE=2
+```
+
+**4. Additional environment settings:**
+
+More environment variable settings can be found in the `scripts/kubernetes-dev-env.sh`.
+
+#### Development Cycle
+
+> [!Warning]
+> This is a very manual process at the moment. We expect to make
+> this more automated in future iterations.
+
+Make your changes locally and commit them. Then select an image tag based on
+the `git` SHA and set your private registry:
+
+```bash
+export EPP_TAG=$(git rev-parse HEAD)
+export IMAGE_REGISTRY="quay.io/<my-id>"
+```
+
+Build the image and tag the image for your private registry:
+
+```bash
+make image-build
+```
+
+and push it:
+
+```bash
+make image-push
+```
+
+You can now re-deploy the environment with your changes (don't forget all of
+the required environment variables):
+
+```bash
+make env-dev-kubernetes
 ```
+
+And test the changes.
+
+### Cleanup Environment
+
+To clean up the development environment and remove all deployed resources in your namespace, run:
+
+```bash
+make clean-env-dev-kubernetes
+```
+
+If you also want to remove the namespace entirely, run:
+
+```bash
+kubectl delete namespace ${NAMESPACE}
+```
+
+To uninstall the infra-stracture development:
+Uninstal GIE CRDs:
+
+```bash
+kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml --ignore-not-found
+```
+
+Uninstall kgateway:
+
+```bash
+helm uninstall kgateway -n kgateway-system
+helm uninstall kgateway-crds -n kgateway-system
+```
+
+For more details, see the Gateway API inference Extension [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
diff --git a/Makefile b/Makefile
@@ -298,3 +298,16 @@ env-dev-kind: image-build  ## Run under kind ($(KIND_CLUSTER_NAME))
 clean-env-dev-kind:      ## Cleanup kind setup (delete cluster $(KIND_CLUSTER_NAME))
 	@echo "INFO: cleaning up kind cluster $(KIND_CLUSTER_NAME)"
 	kind delete cluster --name $(KIND_CLUSTER_NAME)
+
+
+# Kubernetes Development Environment - Deploy
+# This target deploys the inference scheduler stack in a specific namespace for development and testing.
+.PHONY: env-dev-kubernetes
+env-dev-kubernetes: check-kubectl check-kustomize check-envsubst
+	IMAGE_REGISTRY=$(IMAGE_REGISTRY) ./scripts/kubernetes-dev-env.sh 2>&1
+
+# Kubernetes Development Environment - Teardown
+.PHONY: clean-env-dev-kubernetes
+clean-env-dev-kubernetes: check-kubectl check-kustomize check-envsubst
+	@CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1
+	@echo "INFO: Finished cleanup of development environment for namespace $(NAMESPACE)"
diff --git a/deploy/components/inference-gateway/inference-models.yaml b/deploy/components/inference-gateway/inference-models.yaml
@@ -1,7 +1,7 @@
 apiVersion: inference.networking.x-k8s.io/v1alpha2
 kind: InferenceModel
 metadata:
-  name: ${MODEL_NAME}
+  name: ${MODEL_NAME_SAFE}
 spec:
   modelName: ${MODEL_NAME}
   criticality: Critical

diff --git a/deploy/components/vllm-sim/deployments.yaml b/deploy/components/vllm-sim/deployments.yaml
@@ -1,7 +1,7 @@
 apiVersion: apps/v1
 kind: Deployment
 metadata:
-  name: ${MODEL_NAME}-vllm-sim
+  name: ${MODEL_NAME_SAFE}-vllm-sim
   labels:
     app: ${POOL_NAME}
 spec:

diff --git a/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml b/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml
@@ -0,0 +1,22 @@
+apiVersion: gateway.kgateway.dev/v1alpha1
+kind: GatewayParameters
+metadata:
+  name: custom-gw-params
+spec:
+  kube:
+    envoyContainer:
+      securityContext:
+        allowPrivilegeEscalation: false
+        readOnlyRootFilesystem: true
+        runAsNonRoot: true
+        runAsUser: "${PROXY_UID}"
+    service:
+      type: ${GATEWAY_SERVICE_TYPE}
+      extraLabels:
+        gateway: custom
+    podTemplate:
+      extraLabels:
+        gateway: custom
+      securityContext:
+        seccompProfile:
+          type: RuntimeDefault