Description
Summary:
GuideLLM is expected to be highly useful for benchmarking and evaluating our Inference Gateway project: [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension). However, we need to operate at higher QPS with 6-10 backends to truly evaluate our performance effectively. The sweep benchmark on our 10 vLLM server setup (H100 GPUs in GKE) showed surprisingly low max KV cache use (around 25%) and low QPS (max 210 QPS vs. expected ~260). We need to figure out the best settings or fix any issues in GuideLLM's sweep mode to match our past results better.
Detailed Benchmark Issue: Low KV Cache Utilization and QPS Limitations
Setup:
- 10 vLLM model servers on H100 80GB GPUs on GKE
- Kubernetes (GKE), c4-standard-192 machine with vLLM endpoints co-located in the same cluster
- GuideLLM benchmark command running on the c4-standard-192 hardware:
guidellm benchmark --target http://34.2.30.26:80 \ --model meta-llama/Llama-3.1-8B-Instruct \ --processor meta-llama/Llama-3.1-8B-Instruct \ --data '{"prompt_tokens":955,"output_tokens":162,"samples":1000,"prompt_tokens_stdev":76,"output_tokens_stdev":92,"prompt_tokens_min":4,"prompt_tokens_max":1024,"output_tokens_min":4,"output_tokens_max":1024}' \ --rate-type sweep --rate 10 --max-seconds 300 \ --warmup-percent 0.0 --cooldown-percent 0.0 \ --output-path results/sweep_10_2025-04-24_20-09-17.json
- Set max concurrency (
GUIDELLM__MAX_CONCURRENCY
) to 16,900 (derived from previous benchmarking results of max 260 QPS at KV Cache saturation with 65 seconds max E2E latency)
Issue Observed:
- Only achieved ~25% KV cache utilization with GuideLLM sweep mode
- Maximum QPS observed was limited to 210 (expected ~260 based on previous benchmarks)
- Waiting queue size remained relatively low during tests confirming that throughput was not maximized
Observations from Tests:
The following figures illustrate:
-
Request/s variation: clearly shows how Request/s changes with the sweep, peaking around ~210 Request/s.
-
KV Cache Utilization: remained around 25%, indicating underutilization.
-
Queue Waiting Size: remained low throughout the test, showing no significant queue buildup.
kv cache utilization during the sweep

qps obtained during the sweep

waiting queue size during the sweep

Additional Context:
- Prior benchmarks conducted with
benchmarking_serving.py
from the vLLM repository achieved significantly higher KV cache utilization (100%) and QPS (~260). Current tests, despite being on identical hardware (c4-standard-192), are consistently falling short of these benchmarks.
Action Requested:
- Guidance on tuning or configuration adjustments to achieve expected performance by saturating KV Cache.
- Clarification on sweep mode limitations or best practices
- Recommendations or documentation updates for achieving non-streaming mode benchmarks
Metadata
Metadata
Assignees
Labels
Type
Projects
Status