GuideLLM Sweep Benchmark: Low KV Cache Utilization and Limited QPS on H100 GPUs


**Summary:**

GuideLLM is expected to be highly useful for benchmarking and evaluating our Inference Gateway project: [[Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension)](https://github.com/kubernetes-sigs/gateway-api-inference-extension). However, we need to operate at higher QPS with 6-10 backends to truly evaluate our performance effectively. The sweep benchmark on our 10 vLLM server setup (H100 GPUs in GKE) showed surprisingly low max KV cache use (around 25%) and low QPS (max 210 QPS vs. expected ~260). We need to figure out the best settings or fix any issues in GuideLLM's sweep mode to match our past results better.

---

### Detailed Benchmark Issue: Low KV Cache Utilization and QPS Limitations

**Setup:**

- 10 vLLM model servers on H100 80GB GPUs on GKE
- Kubernetes (GKE), c4-standard-192 machine with vLLM endpoints co-located in the same  cluster
- GuideLLM benchmark command running on the c4-standard-192 hardware:
  ```bash
  guidellm benchmark --target http://34.2.30.26:80 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --processor meta-llama/Llama-3.1-8B-Instruct \
    --data '{"prompt_tokens":955,"output_tokens":162,"samples":1000,"prompt_tokens_stdev":76,"output_tokens_stdev":92,"prompt_tokens_min":4,"prompt_tokens_max":1024,"output_tokens_min":4,"output_tokens_max":1024}' \
    --rate-type sweep --rate 10 --max-seconds 300 \
    --warmup-percent 0.0 --cooldown-percent 0.0 \
    --output-path results/sweep_10_2025-04-24_20-09-17.json
  ```
- Set max concurrency (`GUIDELLM__MAX_CONCURRENCY`) to 16,900 (derived from previous benchmarking results of max 260 QPS at KV Cache saturation with 65 seconds max E2E latency)

**Issue Observed:**

- Only achieved ~25% KV cache utilization with GuideLLM sweep mode
- Maximum QPS observed was limited to 210 (expected ~260 based on previous benchmarks)
- Waiting queue size remained relatively low during tests confirming that throughput was not maximized

**Observations from Tests:**

The following figures illustrate:

1. **Request/s variation:** clearly shows how Request/s changes with the sweep, peaking around ~210 Request/s.

2. **KV Cache Utilization:** remained around 25%, indicating underutilization.

3. **Queue Waiting Size:** remained low throughout the test, showing no significant queue buildup.

# *kv cache utilization during the sweep*
<img width="935" alt="Image" src="https://github.com/user-attachments/assets/6c579ee9-95c0-4f8e-9061-2d7f6da6047c" />

# *qps obtained during the sweep*
<img width="1334" alt="Image" src="https://github.com/user-attachments/assets/c2c32f9a-6342-4c71-a8ad-18a72fbb119e" />

# *waiting queue size during the sweep*

<img width="945" alt="Image" src="https://github.com/user-attachments/assets/db1f9728-5766-4047-985e-1d74e71a98cc" />

**Additional Context:**

- Prior benchmarks conducted with `benchmarking_serving.py` from the vLLM repository achieved significantly higher KV cache utilization (100%) and QPS (~260). Current tests, despite being on identical hardware (c4-standard-192), are consistently falling short of these benchmarks.

**Action Requested:**

- Guidance on tuning or configuration adjustments to achieve expected performance by saturating KV Cache. 
- Clarification on sweep mode limitations or best practices
- Recommendations or documentation updates for achieving non-streaming mode benchmarks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GuideLLM Sweep Benchmark: Low KV Cache Utilization and Limited QPS on H100 GPUs #136

Detailed Benchmark Issue: Low KV Cache Utilization and QPS Limitations

kv cache utilization during the sweep

qps obtained during the sweep

waiting queue size during the sweep

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GuideLLM Sweep Benchmark: Low KV Cache Utilization and Limited QPS on H100 GPUs #136

Description

Detailed Benchmark Issue: Low KV Cache Utilization and QPS Limitations

kv cache utilization during the sweep

qps obtained during the sweep

waiting queue size during the sweep

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions