Utilities and workflow helpers for managing LLM-D deployments in Kubernetes. Clone with submodules:
git clone --recursive https://github.com/LucasWilkinson/llm-d-utils- Prerequisites
- Initial Setup
- Everyday Commands
- Benchmark Configuration
- Building Custom vLLM Images
- Troubleshooting
Make sure the following tools are installed and available in your PATH:
- just for running the recipes in this repo
- kubectl configured for the target cluster
- helm
- stern for streaming pod logs
- watch
- Optional: fzf for the nicer interactive pod pickers used by several recipes
-
Create a
.envfileThe Justfile loads environment variables via
set dotenv-load. Create a.envfile in the project root with your configuration and secrets:USER_NAME=your-username HF_TOKEN=your-huggingface-token GH_TOKEN=your-github-token QUAY_REPO=your-quay-username QUAY_ROBOT=buildbot QUAY_PASSWORD=your-robot-account-token
USER_NAMEis used to generate your namespace:USER_NAME + "-llm-d-wide-ep"(defaults to your system username if not set)
To get quay.io credentials:
- Log into quay.io (via SSO)
- Go to Account Settings → Robot Accounts
- Create a new robot account (e.g.,
buildbot) - Copy the token and use it as
QUAY_PASSWORD QUAY_REPOshould be your quay.io username (not the robot account name)- The full robot account name will be constructed as
QUAY_REPO+QUAY_ROBOT
IMPORTANT: Before building, you must also:
- Create the repository
llm-d-cuda-devin quay.io (can be public or private) - Go to the repository → Settings → User and Robot Permissions
- Add your robot account (
QUAY_REPO+QUAY_ROBOT) with Write permission
These values are required for the secret creation step below.
-
Point kubectl at your token file
Export the kubeconfig path you received from the platform (example path shown below):
export KUBECONFIG=~/kubectl-token.txt
-
Create Kubernetes secrets
Run:
just create-secrets
This will create (or update) the
llm-d-hf-token,gh-token-secret, andregistry-authsecrets in your namespace using the values from.env. -
(Optional) Set your kubectl namespace
To avoid specifying
-n {{NAMESPACE}}manually, update your context with:just set-namespace
-
Deploy the workload
Launch the deployment using Kustomize and Helm:
just start
This will:
- Deploy model servers using
kubectl apply -k(CoreWeave variant) - Install the InferencePool via Helm (with Istio gateway)
- Deploy the Istio gateway and HTTPRoute
To tear it back down, run
just stop. This removes the Helm release, model server manifests, and gateway resources.The deployment uses manifests from
llm-d/guides/wide-ep-lws/manifests/and values fromllm-d/guides/wide-ep-lws/inferencepool.values.yaml.The benchmarking helpers (e.g.
just run-bench) default to the deployment's model (deepseek-ai/DeepSeek-R1-0528). If you change the model, update theMODELvariable near the top of theJustfileso the generated remote Justfile targets the right endpoint. - Deploy model servers using
-
just startDeploy the full stack (model servers, InferencePool, gateway) using Kustomize and Helm.
-
just stopTear down the deployment (removes Helm release, model server manifests, and gateway).
-
just restartStop and start the deployment (
just stop && just start). -
just update-image TAGUpdate the decode.yaml and prefill.yaml manifests to use a custom image with the specified tag. Example:
just update-image test-latest-main
-
just get-podsList all pods in the configured namespace.
-
just statusWatch pod status in real-time using
watch -n 2 kubectl get pods. -
just describe [name=pod-name]Describe a pod. If
nameis omitted, you'll get an interactive picker. Requiresfzffor fuzzy selection, otherwise falls back to shellselect. -
just stern [name=pod-name] [-- <stern flags>]Stream logs from pods using stern. With no
name, you get the interactive picker. Flags after--are forwarded to stern (e.g.,just stern -- -c vllm-worker). -
just print-gpusShow GPU allocation across all cluster nodes, grouped by node and namespace.
-
just cks-nodesDisplay CoreWeave node information (type, link speed, IB speed, reliability, etc.).
-
just start-benchCreate the benchmark-interactive pod for running benchmarks.
-
just stop-benchDelete the benchmark-interactive pod.
-
just restart-benchStop and start the benchmark pod (
just stop-bench && just start-bench). -
just interact-benchOpen an interactive shell in the benchmark pod with the Justfile and scripts copied in.
-
just run-bench NAME [IN_TOKENS] [OUT_TOKENS] [NUM_PROMPTS] [CONCURRENCY_LEVELS]Run a benchmark with the specified name and parameters. Parameters are positional. Example:
just run-bench run1 256 1024 8192. See "Benchmark Configuration" below for details. -
just cp-resultsCopy the most recent benchmark results from the benchmark pod to
results/<timestamp>locally.
-
just start-build-podCreate the buildah build pod for building custom vLLM images.
-
just stop-build-podDelete the buildah build pod.
-
just build-image VLLM_COMMIT TAG [use_sccache]Build a custom vLLM image with the specified commit SHA and tag.
use_sccachedefaults totrue. Example:just build-image abc123def my-custom-tag false
-
just set-namespaceUpdate your kubectl context to default to the configured namespace.
-
just create-secretsCreate or update Kubernetes secrets (HF token, GH token, registry auth) from
.envfile. -
just create-registry-authCreate or update only the registry authentication secret.
-
just print-results DIR STRGrep for a string in benchmark result logs and print sorted results.
-
just print-throughput DIRPrint output token throughput from benchmark results in a directory.
-
just print-tpot DIRPrint median time-per-output-token (TPOT) from benchmark results in a directory.
just run-bench accepts parameters to tune the benchmark payload. Parameters can be passed either positionally or as named arguments:
Positional (recommended):
just run-bench run1 256 1024 8192Named arguments:
just run-bench name=run1 in_tokens=256 out_tokens=1024 num_prompts=8192name(required): Benchmark run name for organizing resultsin_tokens(default128): Prompt length fed tovllm benchout_tokens(default2048): Target completion lengthnum_prompts(default16384): Total requests per concurrency levelconcurrency_levels(default'8192 16384 32768'): Space-separated list of concurrency levels to sweep
These values are forwarded to the benchmark pod as environment variables. You can also invoke the benchmark manually:
kubectl exec -n NAMESPACE benchmark-interactive -- \
env INPUT_TOKENS=256 OUTPUT_TOKENS=1024 NUM_PROMPTS=8192 \
bash /app/run.shTo build a custom vLLM image with a specific commit:
-
Start the build pod:
just start-build-pod
-
Build and push the image:
just build-image VLLM_COMMIT_SHA TAG # Example: just build-image 8ce5d3198d00631a76e1aa02a57947b46bc7218c mtp-enabledThis will:
- Clone the llm-d repository
- Update the Dockerfile with your specified vLLM commit
- Build the image using buildah
- Push to
quay.io/QUAY_REPO/llm-d-cuda-dev:TAG
-
Update the manifests:
Edit
llm-d/guides/wide-ep-lws/manifests/modelserver/base/decode.yamlandprefill.yamlto use your custom image:image: quay.io/your-repo/llm-d-cuda-dev:your-tag
-
Clean up the build pod:
just stop-build-pod
Note: The build takes 30-60+ minutes. Monitor progress with:
kubectl logs -f buildah-build -n your-namespace- If
justreports missing environment variables, double-check your.envfile and ensure you’re running commands from the repository root. - Kubernetes errors such as
CreateContainerConfigErrorusually indicate a missing or misnamed secret; re-runjust create-secretsafter updating.env, or inspect the pod events viajust describe name=.... - For log streaming issues, ensure
sternis installed and your kubeconfig points to the correct cluster.
With the setup above you should be able to deploy, inspect, and debug the LLMD workloads quickly using the provided Just recipes.