This guide provides detailed instructions for configuring the llm-d-benchmark Kubernetes workflow to match your deployment and benchmarking requirements.
Before running the workflow, you'll need to customize several configuration files:
- Model Configuration - Specify which model to benchmark
- Workload Profiles and Scenarios - Choose benchmark parameters and test scenarios
- Environment Variables - Configure cluster-specific settings
- Scenario-Based Configuration - Use predefined hardware-optimized configurations
To customize which model is being benchmarked, update the model_name field in resources/benchmark-workload-configmap.yaml:
data:
llmdbench_workload.yaml: |
# Change this to match your deployed model name, from model_endpoint/v1/models
model_name: "meta-llama/Llama-3.2-3B-Instruct"From ../workload/profiles/, you can choose from these benchmark profiles:
sanity_short-input.yaml.in- Quick sanity test with short inputssanity_long-input.yaml.in- Quick sanity test with long inputssanity_sharegpt.yaml.in- Quick test using ShareGPT datasetsmall_model_long_input.yaml.in- Optimized for small models (1B-3B parameters)medium_model_long_input.yaml.in- Optimized for medium models (8B parameters)large_model_long_input.yaml.in- Optimized for large models (70B+ parameters)
Configure the scenarios field in resources/benchmark-workload-configmap.yaml:
scenarios: "long-input" # Options: short-input, long-input, sharegptScenario Options:
short-input- Tests with shorter prompts and responseslong-input- Tests with longer prompts and responsessharegpt- Uses ShareGPT conversation dataset
Customize the load testing parameters:
qps_values: "0.1 0.25 0.5" # Space-separated list of QPS values to testRecommended QPS Values by Model Size:
- Small models (1B-3B):
"0.5 1.0 2.0" - Medium models (8B):
"0.1 0.25 0.5" - Large models (70B+):
"0.05 0.1 0.25"
Update resources/benchmark-env.yaml with your cluster-specific settings.
apiVersion: v1
kind: ConfigMap
metadata:
name: benchmark-env
namespace: llm-d-benchmark
data:
# Core benchmark configuration
LLMDBENCH_FMPERF_NAMESPACE: "llm-d-benchmark" # Namespace for benchmark jobs
LLMDBENCH_HARNESS_STACK_TYPE: "vllm-prod" # "vllm-prod" for standalone, "llm-d" for llm-d stack
LLMDBENCH_HARNESS_ENDPOINT_URL: "https://your-model-endpoint" # UPDATE: Your model service endpoint
LLMDBENCH_HARNESS_STACK_NAME: "standalone-vllm-llama-3b" # Unique identifier for this benchmark run
LLMDBENCH_HARNESS_WORKLOAD_FILE: "llmdbench_workload.yaml" # Workload configuration file name
LLMDBENCH_FMPERF_REPETITION: "1" # Number of times to repeat the benchmark
LLMDBENCH_FMPERF_HARNESS_DIR: "/requests" # Directory to store results (keep as /requests)This is the most critical variable to update. Point it to your deployed model service:
For standalone vLLM deployments:
Use internal service URL, http://service-name.namespace.svc.cluster.local:port
LLMDBENCH_HARNESS_ENDPOINT_URL: "http://vllm-service.vllm-namespace.svc.cluster.local:8000"For llm-d stack deployments:
Use internal service URL, http://llm-d-inference-gateway.namespace.svc.cluster.local:80
LLMDBENCH_HARNESS_ENDPOINT_URL: "http://llm-d-inference-gateway.llm-d.svc.cluster.local:80"Set based on your deployment type:
"vllm-prod"for standalone vLLM deployments"llm-d"for llm-d stack deployments
Choose a descriptive name for your benchmark run:
Examples:
standalone-vllm-3b-instructllm-d-8b-basestandalone-vllm-70b-instruct
# Advanced configuration (optional)
LLMDBENCH_FMPERF_REPETITION: "3" # Run benchmark 3 times for better statistics
LLMDBENCH_CONTROL_WAIT_TIMEOUT: "3600" # Increase timeout for large models (seconds)Instead of manually configuring environment variables, you can use predefined scenarios from ../scenarios/. These scenarios contain optimized configurations for specific hardware and model combinations.
ocp_H100_deployer_llama-70b.sh- H100 GPU with 70B model using llm-d deployerocp_H100_standalone_llama-70b.sh- H100 GPU with 70B model using standalone vLLMocp_L40_deployer_llama-3b.sh- L40 GPU with 3B model using llm-d deployerocp_L40_standalone_llama-3b.sh- L40 GPU with 3B model using standalone vLLMocp_L40_standalone_llama-8b.sh- L40 GPU with 8B model using standalone vLLMkubernetes_H200_deployer_llama-8b.sh- H200 GPU with 8B model using llm-d deployerocp_H100MIG_deployer_llama-3b.sh- H100 MIG with 3B model using llm-d deployerocp_H100MIG_deployer_llama-8b.sh- H100 MIG with 8B model using llm-d deployer
- Choose a scenario that matches your hardware and model requirements
- View the scenario file to see the environment variables:
cat ../scenarios/ocp_L40_deployer_llama-3b.sh
- Copy relevant variables to your
resources/benchmark-env.yaml
Example scenario content:
export LLMDBENCH_DEPLOY_MODEL_LIST=llama-3b
export LLMDBENCH_VLLM_COMMON_AFFINITY=nvidia.com/gpu.product:NVIDIA-L40S
export LLMDBENCH_VLLM_COMMON_PVC_STORAGE_CLASS=nfs-client-pokprod
export LLMDBENCH_VLLM_COMMON_REPLICAS=1resources/benchmark-env.yaml:
data:
LLMDBENCH_HARNESS_STACK_TYPE: "vllm-prod"
LLMDBENCH_HARNESS_ENDPOINT_URL: "http://vllm-service.vllm-ns.svc.cluster.local:8000"
LLMDBENCH_HARNESS_STACK_NAME: "standalone-vllm-3b-instruct"
LLMDBENCH_FMPERF_JOB_ID: "standalone-vllm-3b-instruct"resources/benchmark-workload-configmap.yaml:
data:
llmdbench_workload.yaml: |
model_name: "meta-llama/Llama-3.2-3B-Instruct"
scenarios: "long-input"
qps_values: "0.5 1.0 2.0"resources/benchmark-env.yaml:
data:
LLMDBENCH_HARNESS_STACK_TYPE: "llm-d"
LLMDBENCH_HARNESS_ENDPOINT_URL: "http://llm-d-inference-gateway.llm-d.svc.cluster.local:80"
LLMDBENCH_HARNESS_STACK_NAME: "llm-d-70b-instruct"
LLMDBENCH_FMPERF_JOB_ID: "llm-d-70b-instruct"
LLMDBENCH_CONTROL_WAIT_TIMEOUT: "3600"resources/benchmark-workload-configmap.yaml:
data:
llmdbench_workload.yaml: |
model_name: "meta-llama/Llama-3.1-70B-Instruct"
scenarios: "long-input"
qps_values: "0.05 0.1 0.25"