Skip to content

Latest commit

 

History

History
387 lines (268 loc) · 12.7 KB

File metadata and controls

387 lines (268 loc) · 12.7 KB

Development

Documentation for developing the inference scheduler.

Requirements

Note

Python is NOT required as of v0.5.1. Tokenization is handled by a separate UDS (Unix Domain Socket) tokenizer sidecar container. Previous versions (< v0.5.1) used embedded Python tokenizers with daulet/tokenizers bindings, but these are now deprecated.

Tokenization Architecture

The project uses UDS (Unix Domain Socket) tokenization. Tokenization is handled by a separate UDS tokenizer sidecar container, not by the EPP container itself. Previous embedded tokenizer approaches (daulet/tokenizers, direct Python/vLLM linking) are deprecated and no longer used.

Building the UDS tokenizer image:

make image-build-uds-tokenizer

The image is tagged as ghcr.io/llm-d/llm-d-uds-tokenizer:dev by default. Override with:

UDS_TOKENIZER_TAG=v1.0.0 make image-build-uds-tokenizer

Kind Development Environment

The following deployment creates a Kubernetes in Docker (KIND) cluster with an inference scheduler using a Gateway API implementation, connected to the vLLM simulator. To run the deployment, use the following command:

make env-dev-kind

This will create a kind cluster (or re-use an existing one) using the system's local container runtime and deploy the development stack into the default namespace.

Note

You can download the image locally using docker pull ghcr.io/llm-d/llm-d-inference-sim:latest, and the script will load it from your local Docker registry.

There are several ways to access the gateway:

Port forward:

$ kubectl --context llm-d-inference-scheduler-dev port-forward service/inference-gateway 8080:80

NodePort

# Determine the k8s node address
$ kubectl --context llm-d-inference-scheduler-dev get node -o yaml | grep address
# The service is accessible over port 80 of the worker IP address.

LoadBalancer

# Install and run cloud-provider-kind:
$ go install sigs.k8s.io/cloud-provider-kind@latest && cloud-provider-kind &
$ kubectl --context llm-d-inference-scheduler-dev get service inference-gateway
# Wait for the LoadBalancer External-IP to become available. The service is accessible over port 80.

You can now make requests matching the IP:port of one of the access mode above:

$ curl -s -w '\n' http://<IP:port>/v1/completions -H 'Content-Type: application/json' -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq

By default the created inference gateway, can be accessed on port 30080. This can be overridden to any free port in the range of 30000 to 32767, by running the above command as follows:

KIND_GATEWAY_HOST_PORT=<selected-port> make env-dev-kind

Where: <selected-port> is the port on your local machine you want to use to access the inference gatyeway.

Note

If you require significant customization of this environment beyond what the standard deployment provides, you can use the deploy/components with kustomize to build your own highly customized environment. You can use the deploy/environments/kind deployment as a reference for your own.

Development Cycle

To test your changes to llm-d-inference-scheduler in this environment, make your changes locally and then re-run the deployment:

make env-dev-kind

This will build images with your recent changes and load the new images to the cluster. By default the image tag will be dev. It will also load llm-d-inference-sim image.

Note

The built image tag can be specified via the EPP_TAG environment variable so it is used in the deployment. For example:

EPP_TAG=0.0.4 make env-dev-kind

Note

If you want to load a different tag of llm-d-inference-sim, you can use the environment variable VLLM_SIMULATOR_TAG to specify it.

Note

If you are working on a MacOS with Apple Silicon, it is required to add the environment variable GOOS=linux.

Then do a rollout of the EPP Deployment so that your recent changes are reflected:

kubectl rollout restart deployment food-review-endpoint-picker

Kubernetes Development Environment

A Kubernetes cluster can be used for development and testing. The setup can be split in two:

  • cluster-level infrastructure deployment (e.g., CRDs), and
  • deployment of development environments on a per-namespace basis

This enables cluster sharing by multiple developers. In case of private/personal clusters, the default namespace can be used directly.

Setup - Infrastructure

Caution

In shared cluster situations you should probably not be running this unless you're the cluster admin and you're certain that you should be running this, as this can be disruptive to other developers in the cluster.

The following will deploy all the infrastructure-level requirements (e.g. CRDs, Operators, etc.) to support the namespace-level development environments:

Install Gateway API + GIE CRDs:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml

Install kgateway:

KGTW_VERSION=v2.0.2
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

For more details, see the Gateway API inference Extension getting started guide

Setup - Developer Environment

Note

This setup is currently very manual in regards to container images for the VLLM simulator and the EPP. It is expected that you build and push images for both to your own private registry. In future iterations, we will be providing automation around this to make it simpler.

To deploy a development environment to the cluster, you'll need to explicitly provide a namespace. This can be default if this is your personal cluster, but on a shared cluster you should pick something unique. For example:

export NAMESPACE=annas-dev-environment

Create the namespace:

kubectl create namespace ${NAMESPACE}

Set the default namespace for kubectl commands

kubectl config set-context --current --namespace="${NAMESPACE}"

Note

If you are using OpenShift (oc CLI), you can use the following instead: oc project "${NAMESPACE}"

  • Set Hugging Face token variable:
export HF_TOKEN="<HF_TOKEN>"

Download the llm-d-kv-cache repository (the installation script and Helm chart to install the vLLM environment):

cd .. && git clone git@github.com:llm-d/llm-d-kv-cache.git

If you prefer to clone it into the /tmp directory, make sure to update the VLLM_CHART_DIR environment variable: export VLLM_CHART_DIR=<tmp_dir>/llm-d-kv-cache/vllm-setup-helm

Once all this is set up, you can deploy the environment:

make env-dev-kubernetes

This will deploy the entire stack to whatever namespace you chose.

Note

The model and images of each component can be replaced. See Environment Configuration for model settings.

You can test by exposing the inference gateway via port-forward:

kubectl port-forward service/inference-gateway 8080:80 -n "${NAMESPACE}"

And making requests with curl:

curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
  -d '{"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","prompt":"hi","max_tokens":10,"temperature":0}' | jq

Note

If the response is empty or contains an error, jq may output a cryptic error. You can run the command without jq to debug raw responses.

Environment Configurateion

1. Setting the EPP image registry and tag:

You can optionally set a custom image registry and tag (otherwise, defaults will be used):

export IMAGE_REGISTRY="<YOUR_REGISTRY>"
export EPP_TAG="<YOUR_TAG>"

Note

The full image reference will be constructed as ${EPP_IMAGE}, where EPP_IMAGE defaults to ${IMAGE_REGISTRY}/llm-d-inference-scheduler:{EPP_TAG}. For example, with IMAGE_REGISTRY=quay.io/<my-id> and EPP_TAG=v1.0.0, the final image will be quay.io/<my-id>/llm-d-inference-scheduler:v1.0.0.

2. Setting the vLLM replicas:

You can optionally set the vllm replicas:

export VLLM_REPLICA_COUNT=2

3. Setting the model name:

You can replace the model name that will be used in the system.

export MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2

If you need to deploy a larger model, update the vLLM-related parameters according to the model's requirements. For example:

export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export PVC_SIZE=200Gi
export VLLM_MEMORY_RESOURCES=100Gi
export VLLM_GPU_MEMORY_UTILIZATION=0.95
export VLLM_TENSOR_PARALLEL_SIZE=2
export VLLM_GPU_COUNT_PER_INSTANCE=2

4. Additional environment settings:

More environment variable settings can be found in the scripts/kubernetes-dev-env.sh.

Development Cycle

Warning

This is a very manual process at the moment. We expect to make this more automated in future iterations.

Make your changes locally and commit them. Then select an image tag based on the git SHA and set your private registry:

export EPP_TAG=$(git rev-parse HEAD)
export IMAGE_REGISTRY="quay.io/<my-id>"

Build the image and tag the image for your private registry:

make image-build

and push it:

make image-push

You can now re-deploy the environment with your changes (don't forget all of the required environment variables):

make env-dev-kubernetes

And test the changes.

Cleanup Environment

To clean up the development environment and remove all deployed resources in your namespace, run:

make clean-env-dev-kubernetes

If you also want to remove the namespace entirely, run:

kubectl delete namespace ${NAMESPACE}

To uninstall the infra-stracture development: Uninstal GIE CRDs:

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml --ignore-not-found

Uninstall kgateway:

helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system

For more details, see the Gateway API inference Extension getting started guide

PR Approval Process

The project uses a Prow-inspired ChatOps system to manage PR approvals via comment commands.

Available Commands

Command Policy Description
/approve OWNERS approvers Approve all the files for the current PR. Adds the approve label.
/approve cancel OWNERS approvers Removes your approval on this pull-request. Removes the approve label.
/lgtm OWNERS approvers Adds the lgtm label and enables auto-merge (squash). The PR merges automatically once requiremnets below are met.
/lgtm cancel OWNERS approvers Removes the lgtm label and disables auto-merge.
/hold Anyone with write access Adds the hold label to prevent the PR from merging.
/hold cancel Anyone with write access Removes the hold label.

Merge Requirements

For a PR to be merged, it must have:

  • Both lgtm and approve labels - Required for merge approval
  • No blocking labels - The hold label must not be present
  • All required status checks passing - CI/CD checks must succeed

The gatekeeper workflow enforces these requirements as a required status check.

Approval Reset on New Commits

When new commits are pushed to an approved PR, the lgtm label is automatically removed and auto-merge is disabled. This ensures approvals always reflect the latest code. The author must request a new /lgtm after pushing changes.

Note: The approve label is NOT automatically removed on new commits. If significant changes are made, reviewers should use /approve cancel to remove their approval.