`--batch-size` CLI parameter is added #73

parfeniukink · 2025-02-25T19:44:40Z

Setup the environment

Run the model via vllm or llama.cpp
Execute the guidellm command

Command

guidellm --target "http://localhost:8080/v1" --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" --tokenizer "hf-internal-testing/llama-tokenizer" --data-type emulated --data "prompt_tokens=512,generated_tokens=128" --rate-type constant --rate 2 --max-seconds 100 --batch-size 2

Output

  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:01:40 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (stdout) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8080/v1, model=Phi-3-mini-4k-instruct-q4.gguf)                                                                                     │ │
│ │ Data(type=emulated, source=prompt_tokens=128,generated_tokens=128, tokenizer=hf-internal-testing/llama-tokenizer)                                                                      │ │
│ │ Rate(type=constant, rate=(8.0,))                                                                                                                                                       │ │
│ │ Limits(max_number=None requests, max_duration=100 sec)                                                                                                                                 │ │
│ │                                                                                                                                                                                        │ │
│ │                                                                                                                                                                                        │ │
│ │ Requests Data by Benchmark                                                                                                                                                             │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                                                │ │
│ │ ┃ Benchmark                 ┃ Requests Completed ┃ Request Failed ┃ Duration  ┃ Start Time ┃ End Time ┃                                                                                │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                                                │ │
│ │ │ [email protected] req/sec │ 12/12              │ 0/12           │ 90.03 sec │ 21:46:41   │ 21:48:11 │                                                                                │ │
│ │ └───────────────────────────┴────────────────────┴────────────────┴───────────┴────────────┴──────────┘                                                                                │ │
│ │                                                                                                                                                                                        │ │
│ │ Tokens Data by Benchmark                                                                                                                                                               │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                  │ │
│ │ ┃ Benchmark                 ┃ Prompt ┃ Prompt (1%, 5%, 50%, 95%, 99%)    ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%)  ┃                                                                  │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                  │ │
│ │ │ [email protected] req/sec │ 128.25 │ 128.0, 128.0, 128.0, 129.0, 129.0 │ 117.42 │ 56.3, 65.5, 128.0, 128.0, 128.0 │                                                                  │ │
│ │ └───────────────────────────┴────────┴───────────────────────────────────┴────────┴─────────────────────────────────┘                                                                  │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Stats by Benchmark                                                                                                                                                         │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃                           ┃ Request Latency [1%, 5%, 10%, 50%, 90%, 95%,     ┃ Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, ┃ Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%,  ┃ │ │
│ │ ┃ Benchmark                 ┃ 99%] (sec)                                       ┃ 99%] (ms)                                        ┃ 99%] (ms)                                        ┃ │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ [email protected] req/sec │ 7.76, 7.83, 7.91, 10.56, 16.01, 16.60, 17.16     │ 828.5, 830.6, 833.0, 4789.5, 9004.5, 9510.3,     │ 49.7, 51.1, 51.8, 55.0, 66.4, 70.4, 75.6         │ │ │
│ │ │                           │                                                  │ 9994.8                                           │                                                  │ │ │
│ │ └───────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┘ │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Summary by Benchmark                                                                                                                                                       │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓                                            │ │
│ │ ┃ Benchmark                 ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃                                            │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩                                            │ │
│ │ │ [email protected] req/sec │ 0.13 req/sec        │ 11.60 sec       │ 4941.58 ms          │ 57.20 ms            │ 15.65 tokens/sec        │                                            │ │
│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘                                            │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

markurtz · 2025-03-10T18:12:59Z

Closing this out as all this will do is run a set number of requests equal to the batch size in parallel. To add batch support, we'll either need to run vLLM locally or go through the OpenAI batch processing API which is a significant expansion in scope and work.

--batch-size CLI parameter is added

5a8763c

parfeniukink requested a review from markurtz February 25, 2025 19:44

parfeniukink self-assigned this Feb 25, 2025

fixed code quality issues. fixed tests

14611ef

parfeniukink marked this pull request as draft February 25, 2025 20:01

parfeniukink removed the request for review from markurtz February 25, 2025 20:01

removed unused function

59602cd

parfeniukink requested a review from markurtz February 27, 2025 12:00

rgreenberg1 added the load-request load-request workstream label Feb 28, 2025

rgreenberg1 added this to GuideLLM Kanban Board Feb 28, 2025

rgreenberg1 moved this to In progress in GuideLLM Kanban Board Feb 28, 2025

rgreenberg1 added this to the GuideLLM v0.2.0 - CI/CD Finalization, Documentation Expansion, and Backend Support milestone Feb 28, 2025

rgreenberg1 requested a review from sjmonson March 4, 2025 20:31

markurtz closed this Mar 10, 2025

github-project-automation bot moved this from In progress to Done in GuideLLM Kanban Board Mar 10, 2025

markurtz deleted the parfeniukink/batch-size-cli-parameter branch April 21, 2025 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--batch-size` CLI parameter is added #73

`--batch-size` CLI parameter is added #73

parfeniukink commented Feb 25, 2025 •

edited

Loading

markurtz commented Mar 10, 2025

--batch-size CLI parameter is added #73

--batch-size CLI parameter is added #73

Conversation

parfeniukink commented Feb 25, 2025 • edited Loading

Setup the environment

Command

Output

markurtz commented Mar 10, 2025

`--batch-size` CLI parameter is added #73

`--batch-size` CLI parameter is added #73

parfeniukink commented Feb 25, 2025 •

edited

Loading