Skip to content

--batch-size CLI parameter is added #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

parfeniukink
Copy link
Contributor

@parfeniukink parfeniukink commented Feb 25, 2025

Setup the environment

  1. Run the model via vllm or llama.cpp
  2. Execute the guidellm command

Command

guidellm --target "http://localhost:8080/v1" --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" --tokenizer "hf-internal-testing/llama-tokenizer" --data-type emulated --data "prompt_tokens=512,generated_tokens=128" --rate-type constant --rate 2 --max-seconds 100 --batch-size 2

Output

  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:01:40 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (stdout) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8080/v1, model=Phi-3-mini-4k-instruct-q4.gguf)                                                                                     │ │
│ │ Data(type=emulated, source=prompt_tokens=128,generated_tokens=128, tokenizer=hf-internal-testing/llama-tokenizer)                                                                      │ │
│ │ Rate(type=constant, rate=(8.0,))                                                                                                                                                       │ │
│ │ Limits(max_number=None requests, max_duration=100 sec)                                                                                                                                 │ │
│ │                                                                                                                                                                                        │ │
│ │                                                                                                                                                                                        │ │
│ │ Requests Data by Benchmark                                                                                                                                                             │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                                                │ │
│ │ ┃ Benchmark                 ┃ Requests Completed ┃ Request Failed ┃ Duration  ┃ Start Time ┃ End Time ┃                                                                                │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                                                │ │
│ │ │ [email protected] req/sec │ 12/12              │ 0/12           │ 90.03 sec │ 21:46:41   │ 21:48:11 │                                                                                │ │
│ │ └───────────────────────────┴────────────────────┴────────────────┴───────────┴────────────┴──────────┘                                                                                │ │
│ │                                                                                                                                                                                        │ │
│ │ Tokens Data by Benchmark                                                                                                                                                               │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                  │ │
│ │ ┃ Benchmark                 ┃ Prompt ┃ Prompt (1%, 5%, 50%, 95%, 99%)    ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%)  ┃                                                                  │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                  │ │
│ │ │ [email protected] req/sec │ 128.25 │ 128.0, 128.0, 128.0, 129.0, 129.0 │ 117.42 │ 56.3, 65.5, 128.0, 128.0, 128.0 │                                                                  │ │
│ │ └───────────────────────────┴────────┴───────────────────────────────────┴────────┴─────────────────────────────────┘                                                                  │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Stats by Benchmark                                                                                                                                                         │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃                           ┃ Request Latency [1%, 5%, 10%, 50%, 90%, 95%,     ┃ Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, ┃ Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%,  ┃ │ │
│ │ ┃ Benchmark                 ┃ 99%] (sec)                                       ┃ 99%] (ms)                                        ┃ 99%] (ms)                                        ┃ │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ [email protected] req/sec │ 7.76, 7.83, 7.91, 10.56, 16.01, 16.60, 17.16     │ 828.5, 830.6, 833.0, 4789.5, 9004.5, 9510.3,     │ 49.7, 51.1, 51.8, 55.0, 66.4, 70.4, 75.6         │ │ │
│ │ │                           │                                                  │ 9994.8                                           │                                                  │ │ │
│ │ └───────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┘ │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Summary by Benchmark                                                                                                                                                       │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓                                            │ │
│ │ ┃ Benchmark                 ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃                                            │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩                                            │ │
│ │ │ [email protected] req/sec │ 0.13 req/sec        │ 11.60 sec       │ 4941.58 ms          │ 57.20 ms            │ 15.65 tokens/sec        │                                            │ │
│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘                                            │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@parfeniukink parfeniukink self-assigned this Feb 25, 2025
@parfeniukink parfeniukink marked this pull request as draft February 25, 2025 20:01
@parfeniukink parfeniukink removed the request for review from markurtz February 25, 2025 20:01
@markurtz
Copy link
Member

Closing this out as all this will do is run a set number of requests equal to the batch size in parallel. To add batch support, we'll either need to run vLLM locally or go through the OpenAI batch processing API which is a significant expansion in scope and work.

@markurtz markurtz closed this Mar 10, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in GuideLLM Kanban Board Mar 10, 2025
@markurtz markurtz deleted the parfeniukink/batch-size-cli-parameter branch April 21, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
load-request load-request workstream
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants