Skip to content

GuideLLM Sweep Benchmark: Low KV Cache Utilization and Limited QPS on H100 GPUs #136

Open
@kaushikmitr

Description

@kaushikmitr

Summary:

GuideLLM is expected to be highly useful for benchmarking and evaluating our Inference Gateway project: [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension). However, we need to operate at higher QPS with 6-10 backends to truly evaluate our performance effectively. The sweep benchmark on our 10 vLLM server setup (H100 GPUs in GKE) showed surprisingly low max KV cache use (around 25%) and low QPS (max 210 QPS vs. expected ~260). We need to figure out the best settings or fix any issues in GuideLLM's sweep mode to match our past results better.


Detailed Benchmark Issue: Low KV Cache Utilization and QPS Limitations

Setup:

  • 10 vLLM model servers on H100 80GB GPUs on GKE
  • Kubernetes (GKE), c4-standard-192 machine with vLLM endpoints co-located in the same cluster
  • GuideLLM benchmark command running on the c4-standard-192 hardware:
    guidellm benchmark --target http://34.2.30.26:80 \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --processor meta-llama/Llama-3.1-8B-Instruct \
      --data '{"prompt_tokens":955,"output_tokens":162,"samples":1000,"prompt_tokens_stdev":76,"output_tokens_stdev":92,"prompt_tokens_min":4,"prompt_tokens_max":1024,"output_tokens_min":4,"output_tokens_max":1024}' \
      --rate-type sweep --rate 10 --max-seconds 300 \
      --warmup-percent 0.0 --cooldown-percent 0.0 \
      --output-path results/sweep_10_2025-04-24_20-09-17.json
  • Set max concurrency (GUIDELLM__MAX_CONCURRENCY) to 16,900 (derived from previous benchmarking results of max 260 QPS at KV Cache saturation with 65 seconds max E2E latency)

Issue Observed:

  • Only achieved ~25% KV cache utilization with GuideLLM sweep mode
  • Maximum QPS observed was limited to 210 (expected ~260 based on previous benchmarks)
  • Waiting queue size remained relatively low during tests confirming that throughput was not maximized

Observations from Tests:

The following figures illustrate:

  1. Request/s variation: clearly shows how Request/s changes with the sweep, peaking around ~210 Request/s.

  2. KV Cache Utilization: remained around 25%, indicating underutilization.

  3. Queue Waiting Size: remained low throughout the test, showing no significant queue buildup.

kv cache utilization during the sweep

Image

qps obtained during the sweep

Image

waiting queue size during the sweep

Image

Additional Context:

  • Prior benchmarks conducted with benchmarking_serving.py from the vLLM repository achieved significantly higher KV cache utilization (100%) and QPS (~260). Current tests, despite being on identical hardware (c4-standard-192), are consistently falling short of these benchmarks.

Action Requested:

  • Guidance on tuning or configuration adjustments to achieve expected performance by saturating KV Cache.
  • Clarification on sweep mode limitations or best practices
  • Recommendations or documentation updates for achieving non-streaming mode benchmarks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions