Benchmarking an Existing Stack
A simple, minimal example of using llm-d-benchmark to test an already deployed llm-d stack with inference-perf.
Note: For ease of presentation, the example assumes an OpenShift cluster and uses
oc. For a Kubernetes cluster replaceocbykubectl.
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh
See list of dependencies.
Set up the stack for benchmarking. Since you are starting from an existing stack, you may need to restart some pods to create a clean baseline. For this simple example, if you want to compare different setups (e.g., various epp configurations) then you have to set up each configuration manually and rerun the example for each.
In this example, the benchmark sends requests to an infra-inference-scheduling-inference-gateway endpoint. Replace infra-inference-scheduling-inference-gateway with the name your inference gateway service. Alternatively, use a pod name if you want to benchmark a single vllm instead.
You will need a yaml specification to tell inference-perf how to generate the Workload that would be used to benchmark your stack. inference-perf will generate prompts (AKA a Data Set) with timing (AKA Load).
Several workload examples are available under llm-d-benchmark/workload/profiles/inference-perf. We demonstrate with shared_prefix_synthetic.
Click to view `shared_prefix_synthetic.yaml.in`
For example, shared_prefix_synthetic.yaml.in:
load:
type: constant
stages:
- rate: 2
duration: 50
- rate: 5
duration: 50
- rate: 8
duration: 50
- rate: 10
duration: 50
- rate: 12
duration: 50
- rate: 15
duration: 50
- rate: 20
duration: 50
api:
type: completion
streaming: true
server:
type: vllm
model_name: REPLACE_ENV_LLMDBENCH_DEPLOY_CURRENT_MODEL
base_url: REPLACE_ENV_LLMDBENCH_HARNESS_STACK_ENDPOINT_URL
ignore_eos: true
tokenizer:
pretrained_model_name_or_path: REPLACE_ENV_LLMDBENCH_DEPLOY_CURRENT_TOKENIZER
data:
type: shared_prefix
shared_prefix:
num_groups: 32 # Number of distinct shared prefixes
num_prompts_per_group: 32 # Number of unique questions per shared prefix
system_prompt_len: 2048 # Length of the shared prefix (in tokens)
question_len: 256 # Length of the unique question part (in tokens)
output_len: 256 # Target length for the model's generated output (in tokens)
report:
request_lifecycle:
summary: true
per_stage: true
per_request: true
storage:
local_storage:
path: /workspaceIf you want to create your own inference-perf profile then save your custom yaml specification with a .yaml.in suffix in the same directory.
load and data sections. load tell inference-perf how many sub-test ("stages") to run; each stage has a desired QPS (Queries Per Second) and duration. data defines how to create the DataSet; the configuration parameters are different for each type.
Run oc login ...
Then, run
oc project <_namespace-name_>or, if using kubectl
kubectl config set-context --current --namespace=<_namespace-name_>-
Work Directory: Choose a local work directory to save the results on your computer.
-
Harness Profile: The name of your
.yaml.infile without the suffix, e.g.,shared_prefix_synthetic -
PVC: A Persistent Volume Claim for storing benchmarking results. Must be one of the available PVCs in the cluster.
Click to view bash code snippet
oc get persistentvolumeclaims -o name
-
Hugging-Face Token [Optional]: If none is specified then the
HF_TOKENused in the existingllm-dstack will be used.Click to view bash code snippet
oc get secrets llm-d-hf-token -o jsonpath='{.data.*}' | base64 -d
Prepare a file ./myenv.sh with the following content: (file name must have a .sh suffix)
export LLMDBENCH_DEPLOY_METHODS="infra-inference-scheduling-inference-gateway"
export LLMDBENCH_HARNESS_NAME="inference-perf"
# export LLMDBENCH_HF_TOKEN=<_your Hugging Face token_>
# Work Directory; for example "/tmp/<_namespace_>"
export LLMDBENCH_CONTROL_WORK_DIR="<_name of your local Work Direcotry_>"
# Persistent Volume Claim
export LLMDBENCH_HARNESS_PVC_NAME="<_name of your Harness PVC_>"
# This is a timeout (seconds) for running a full test
# If time expires the benchmark will still run but results will not be collected to local computer.
export LLMDBENCH_HARNESS_WAIT_TIMEOUT=3600cd into llm-d-benchmark root directory.
run.sh \
-c "$(realpath ./myenv.sh)" `# use full path` \
-w "shared_prefix_synthetic" `# <name of the Harness Profile>.yaml.in` \
# -s 10000 # uncomment to setup timeout on command line.
Wait for test completion... ⏳ ... ⏳ ... ⏳ ... ☕ ... ⏳ ... ⏳ ...
Note: Each stage in the test may run longer than the duration specified in the yaml.in file. inference perf sets the desired timing for each request, but waits until all requests complete (succeed/fail). There is no timeout by inference perf itself; however, you can set timeouts in your llm-d stack.
The results would be stored under the given Work Directory.
resultsholds the numeric results and logs (each experiment will have a unique sub-directory)analysisholds the plots (each experiment will have a unique sub-directory)- Other directories hold detailed information on the last run (e.g., the
.yamlfiles used).
The results are not lost if the local run.sh times out or the connection to cluster is lost.
run.sh creates a pod to access the Harness PVC called access-to-harness-data.....
You can oc rsh or kubectl exec -it into the pod and run bash to view the results under the /requests directory.
You can use oc rsync or kubectl copy to fetch the results.
Click to view bash code snippet
Find access pod name, e.g.,
$ oc get pods -l app=llm-d-benchmark-harness -o name
pod/access-to-harness-data-vllm-p2p-70b-chart-llama-3-70b-instruct-storage-claimList latest results (kubectl uses slightly different syntax)
oc rsh pod/access-to-harness-data-vllm-p2p-70b-chart-llama-3-70b-instruct-storage-claim ls -lrt /requests | tail -3
drwxr-sr-x. 3 root 1001020000 13 Aug 5 18:17 inference-perf_1754416561_inference-gateway-70b-instruct
drwxr-sr-x. 3 root 1001020000 13 Aug 5 18:39 inference-perf_1754417987_inference-gateway-70b-instruct
drwxr-sr-x. 3 root 1001020000 13 Aug 5 19:02 inference-perf_1754419311_inference-gateway-70b-instructFetch the results (kubectl uses slightly different syntax)
oc rsync access-to-harness-data-vllm-p2p-70b-chart-llama-3-70b-instruct-storage-claim:/requests/inference-perf_1754419311_inference-gateway-70b-instruct /tmp --no-perms