Skip to content

Benchmark throughput displayed is not the same than the one from vLLM #29

@TheRValiquette

Description

@TheRValiquette

In many occasions, the throughput value detected and displayed by Benchmark is not the same value seen in the vLLM trace.

┌─────────────────┬────────────────────────────────────────────────────────────────┐
│ Parameter       │ Value                                                          │
├─────────────────┼────────────────────────────────────────────────────────────────┤
│ Max VUs         │ 128                                                            │
│ Duration        │ 60                                                             │
│ Warmup Duration │ 30                                                             │
│ Benchmark Kind  │ Sweep                                                          │
│ Rates           │ N/A                                                            │
│ Num Rates       │ 1                                                              │
│ Prompt Options  │ N/A                                                            │
│ Decode Options  │ num_tokens=Some(800),min_tokens=50,max_tokens=800,variance=100 │
│ Tokenizer       │ deepseek-ai/DeepSeek-R1-Distill-Llama-8B                       │
│ Extra Metadata  │ N/A                                                            │
└─────────────────┴────────────────────────────────────────────────────────────────┘


Run 1:

│ Benchmark          │ QPS        │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput         │
├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼────────────────────┼
│ warmup             │ 0.07 req/s │ 13.95 sec         │ 3268.06 ms │ 15.22 ms  │ 50.42 tokens/sec   │
│ throughput         │ 3.84 req/s │ 24.96 sec         │ 307.17 ms  │ 36.85 ms  │ 2560.47 tokens/sec |

Run 2:

Benchmark         │ QPS        │ E2E Latency (avg) │ TTFT (avg)   │ ITL (avg) │ Throughput         │
├─────────────────┼────────────┼───────────────────┼──────────--──┼───────────┼────────────────────┼
warmup            │ 0.08 req/s │ 13.30 sec         │ 1554.01 ms   │ 14.70 ms  │ 60.15 tokens/sec   │
throughput        │ 2.41 req/s │ 38.43 sec         │ 665.35 ms    │ 56.19 ms  │ 1596.76 tokens/sec │

Trace from VLLM similar in both runs:
INFO 06-13 10:23:08 [loggers.py:111] Engine 000: Avg prompt throughput: 548.9 tokens/s, Avg generation throughput: 2475.0 tokens/s, Running: 128 reqs,

The vLLM trace just above is related to Run 2. I would have expected to see a value of 2475 tokens/sec or so in the Run 2 instead of 1596 tokens/sec.

Am I misunderstanding how that works?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions