-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Summary
I've been working on a project involving lmcache tuning, and we have been using this benchmark as part of our evaluation suite.
The metric we are focusing on is "time to first token" (TTFT), and we have been noticing that many of the benchmarks are returning a time of 0 for this metric, which doesn't seem correct.
Additionally, we frequently see average throughput at inf
Is this working as designed? Could some guidance be offered how to better interpret these results?
Details
Command invoked was ./run_benchmarks.sh "meta-llama/Llama-3.1-70B-Instruct" http://khanhtest-vllm-fs-lmcache.ibm-cas-red-stack.svc.cluster.local:8000 /tmp/benchmark/lmcache_off all 1.34 2.0 3.0
/tmp/benchmark/lmcache_off_long_input_output_1.34.csv
[2025-06-16 11:44:01,006] WARNING: Processing the existing summary file /tmp/benchmark/lmcache_off_long_input_output_1.34.csv, ignoring all the other arguments (multi-round-qa.py:722:__main__)
[2025-06-16 11:44:01,008] INFO: Calculating performance summary (multi-round-qa.py:568:__main__)
==================== Performance summary ======================
QPS: 0.0000 reqs/s
Processing speed: 1.3941 reqs/s
Requests on-the-fly: 0
Input tokens per second: 29451.9382 tokens/s
Output tokens per second: 1.3941 tokens/s
Average generation throughput (per request): inf tokens/req/s
^^^^^^^^ inf??
Average TTFT: 0.0000s
^^^^^^^^^~ This takes no time at all?
Time range: 1750097945.6405346 - 1750098046.7831142 (101.14s)
===============================================================
lmcache_off_long_input_output_1.34.csv
Branch commit was 95b2939136ff003e2e4c67277ec82bf43ff8be34 on main.