A configurable pipeline for running GuideLLM benchmarks against LLM endpoints.
- GuideLLM Overview
- Key Features
- Usage
- Configuration Options
- Results
- Output Structure
- GuideLLM Workbench
GuideLLM evaluates and optimizes LLM deployments by simulating real-world inference workloads to assess performance, resource requirements, and cost implications across different hardware configurations.
- Performance & Scalability Testing: Analyze LLM inference under various load scenarios to meet SLOs
- Resource & Cost Optimization: Determine optimal hardware configurations and deployment strategies
- Flexible Deployment: Support for Kubernetes Jobs and Tekton Pipelines with configurable parameters
- Automated Results: Timestamped output directories with comprehensive benchmark results
# Apply the PVC
kubectl apply -f utils/jobs/pvc.yaml
# Apply the ConfigMap (optional)
kubectl apply -f pipeline/config.yaml
# Run the job with default settings
kubectl apply -f utils/jobs/guidellm-job.yaml
# Or customize environment variables
kubectl set env job/run-guidellm TARGET=http://my-endpoint:8000/v1
kubectl set env job/run-guidellm MODEL_NAME=my-model# Apply the task and pipeline
kubectl apply -f pipeline/tekton-task.yaml
kubectl apply -f pipeline/tekton-pipeline.yaml
# Run with parameters
tkn pipeline start guidellm-benchmark-pipeline \
--param target=http://llama32-3b.llama-serve.svc.cluster.local:8000/v1 \
--param model-name=llama32-3b \
--param processor=RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8 \
--param data-config='{"type":"emulated","prompt_tokens":512,"output_tokens":128}' \
--workspace name=shared-workspace,claimName=guidellm-output-pvc(if you need to install the tkn executable, on a Mac, you will want to run brew install tektoncd-cli)
Once the Tekton pipeline starts, the GuideLLM benchmark CLI will be triggered with the input parameters:
The GuideLLM benchmark will begin running and start simulating real-world inference workloads against the target endpoint:
TARGET: Model endpoint URLMODEL_NAME: Model identifierPROCESSOR: Processor/model pathDATA_CONFIG: JSON data configurationOUTPUT_FILENAME: Output file nameRATE_TYPE: Rate type (synchronous/poisson)MAX_SECONDS: Maximum benchmark duration
The benchmark generates comprehensive performance metrics and visualizations:
The results provide detailed insights into throughput, latency, resource utilization, and other key performance indicators to help optimize your LLM deployment strategy.
Results are organized in timestamped directories:
/output/
├── model-name_YYYYMMDD_HHMMSS/
│ ├── benchmark-results.yaml
│ └── benchmark_info.txt
└── model-name_YYYYMMDD_HHMMSS.tar.gz
The GuideLLM Workbench provides a user-friendly Streamlit web interface for running benchmarks interactively with real-time monitoring and result visualization.
- Interactive Configuration: Easy-to-use forms for endpoint, authentication, and benchmark parameters
- Real-time Monitoring: Live metrics parsing during benchmark execution
- Quick Stats: Sidebar with key performance indicators (requests/sec, tokens/sec, latency, TTFT)
- Results History: Session-based storage with detailed result viewing
- Download Results: Export benchmark results as YAML files
- Comprehensive Results View: Detailed breakdown of all performance metrics
cd utils/guidellm-wb
pip install -r requirements.txt
streamlit run app.py# Use pre-built container from registry
podman run -p 8501:8501 quay.io/rh-aiservices-bu/guidellm-wb:v1The workbench will be available at http://localhost:8501 and provides an intuitive interface for configuring and running GuideLLM benchmarks with immediate feedback and comprehensive result analysis.






