Inconsistent Arguments Between llama-bench and llama-cli / llama-server #23979

d-shehu · 2026-06-01T16:30:32Z

d-shehu
Jun 1, 2026

Perhaps I'm missing something, is there a reason that I can't run llama bench with the same arguments I use with the CLI and server? I had to strip out everything after -ngl and then it failed to load because it wasn't using both GPUs.

Everything works fine with CLI and server! -ngl is clearly listed as supported. It's been a while, 6+ months, since I've used the benchmark but I don't recall having to customize the arguments. Thanks.

OS: Ubuntu 24.04
Release: b9222

error: invalid parameter for argument: -ngl

./llama-bench --model ./models/models--unsloth--Qwen3.6-27B-GGUF/UD-Q8_K_XL/Qwen3.6-27B-UD-Q8_K_XL.gguf 
-t 16 
-dev Vulkan0,Vulkan1 
--tensor-split 30,30 
--mmap 0 
--no-warmup 
-ngl all 
-sm layer 
--fit off 
-fa auto 
-ub 2048 -b 2048 
--spec-type draft-mtp 
--spec-draft-n-max 3 
--spec-draft-ngl all 
--cache-type-k q8_0 
--cache-type-v q8_0 
--spec-draft-type-k q8_0 
--spec-draft-type-v q8_0

Usage

usage: ./llama-bench [options]

options:
  -h, --help
  --numa <distribute|isolate|numactl>         numa mode (default: disabled)
  -r, --repetitions <n>                       number of times to repeat each test (default: 5)
  --prio <-1|0|1|2|3>                         process/thread priority (default: 0)
  --delay <0...N> (seconds)                   delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>        output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql>   output format printed to stderr (default: none)
  --list-devices                              list available devices and exit
  -v, --verbose                               verbose output
  --progress                                  print test progress indicators
  --no-warmup                                 skip warmup runs before benchmarking
  -fitt, --fit-target <MiB>                   fit model to device memory with this margin per device in MiB (default: off)
  -fitc, --fit-ctx <n>                        minimum ctx size for --fit-target (default: 4096)
  -rpc, --rpc <rpc_servers>                   register RPC devices (comma separated)

test parameters:
  -m, --model <filename>                      (default: models/7B/ggml-model-q4_0.gguf)
  -hf, -hfr, --hf-repo <user>/<model>[:quant] Hugging Face model repository; quant is optional, case-insensitive
                                              default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.
                                              example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                              (default: unused)
  -hff, --hf-file <file>                      Hugging Face model file. If specified, it will override the quant in --hf-repo
                                              (default: unused)
  -hft, --hf-token <token>                    Hugging Face access token
                                              (default: value from HF_TOKEN environment variable)
  -p, --n-prompt <n>                          (default: 512)
  -n, --n-gen <n>                             (default: 128)
  -pg <pp,tg>                                 (default: )
  -d, --n-depth <n>                           (default: 0)
  -b, --batch-size <n>                        (default: 2048)
  -ub, --ubatch-size <n>                      (default: 512)
  -ctk, --cache-type-k <t>                    (default: f16)
  -ctv, --cache-type-v <t>                    (default: f16)
  -t, --threads <n>                           (default: 16)
  -C, --cpu-mask <hex,hex>                    (default: 0x0)
  --cpu-strict <0|1>                          (default: 0)
  --poll <0...100>                            (default: 50)
  -ngl, --n-gpu-layers <n>                    (default: 99)
  -ncmoe, --n-cpu-moe <n>                     (default: 0)
  -sm, --split-mode <none|layer|row|tensor>   (default: layer)
  -mg, --main-gpu <i>                         (default: 0)
  -nkvo, --no-kv-offload <0|1>                (default: 0)
  -fa, --flash-attn <0|1>                     (default: 0)
  -dev, --device <dev0/dev1/...>              (default: auto)
  -mmp, --mmap <0|1>                          (default: 1)
  -dio, --direct-io <0|1>                     (default: 0)
  -embd, --embeddings <0|1>                   (default: 0)
  -ts, --tensor-split <ts0/ts1/..>            (default: 0)
  -ot --override-tensor <tensor name pattern>=<buffer type>;...
                                              (default: disabled)
  -nopo, --no-op-offload <0|1>                (default: 0)
  --no-host <0|1>                             (default: 0)

Answered by Manoj-Gujare

Jun 1, 2026

Hi @d-shehu ,

llama-bench doesn't share the argument parser used by llama-cli and llama-server. It's a standalone benchmarking tool with its own (intentionally smaller) option set, defined separately in tools/llama-bench/llama-bench.cpp rather than through the shared common args. So nothing is broken on your end — several of your flags either don't exist in bench or use different conventions, and the command just needs to be translated.

There are three separate issues in your command.

1. `-ngl all` must be a number

The flag is supported, but on b9222 bench's -ngl / --n-gpu-layers takes an integer <n> (default 99), not the auto / all keyword that cli/server accept. The error invalid paramete…

View full answer

Manoj-Gujare · 2026-06-01T17:36:29Z

Manoj-Gujare
Jun 1, 2026

Hi @d-shehu ,

llama-bench doesn't share the argument parser used by llama-cli and llama-server. It's a standalone benchmarking tool with its own (intentionally smaller) option set, defined separately in tools/llama-bench/llama-bench.cpp rather than through the shared common args. So nothing is broken on your end — several of your flags either don't exist in bench or use different conventions, and the command just needs to be translated.

There are three separate issues in your command.

1. `-ngl all` must be a number

The flag is supported, but on b9222 bench's -ngl / --n-gpu-layers takes an integer <n> (default 99), not the auto / all keyword that cli/server accept. The error invalid parameter for argument: -ngl is rejecting the value, not the flag. Use -ngl 99 (anything ≥ the model's layer count offloads everything).

On current master this was relaxed (bench now lists -ngl default -1), so keyword behavior may arrive in a later build — but your b9222 binary needs an explicit number.

2. This is why it stopped using both GPUs — bench uses `/`, not `,`, for multi-GPU

In llama-bench a comma means "run another benchmark." The tool takes the cartesian product of every comma-separated value, so:

-dev Vulkan0,Vulkan1 → one test on Vulkan0, then a separate test on Vulkan1 — both single-GPU. That's exactly why neither run used both cards.
--tensor-split 30,30 → a test with split 30, then another with split 30 — not a 30/30 split.

Devices and tensor-splits that belong to a single run are slash-separated, which is why the help shows -dev <dev0/dev1/...> and -ts <ts0/ts1/..>. This is the behavior introduced when --device was added to bench in [#16039](#16039) — comma benchmarks each device separately, slash combines them into one run. So you want:

-dev Vulkan0/Vulkan1 -ts 30/30 -sm layer

3. Flags that don't exist in bench (remove them)

--fit off — there's no --fit. Bench's fit feature is -fitt / --fit-target <MiB> and -fitc / --fit-ctx, and it's off by default, so just drop this.
-fa auto — on b9222 bench's flag is -fa, --flash-attn <0|1>. Use -fa 1. (master later changed this to <on|off|auto>, but your build only accepts 0/1.)
--spec-type, --spec-draft-n-max, --spec-draft-ngl, --spec-draft-type-k, --spec-draft-type-v — none of the speculative-decoding options exist in llama-bench. It has no draft-model path, so you can't benchmark spec-decoding through it; it only benches the target model on its own. To measure spec-decoding throughput, time it via llama-cli / llama-server instead.

--cache-type-k / --cache-type-v are fine — those long forms exist in bench as -ctk / -ctv.

Corrected command

./llama-bench \
  --model ./models/models--unsloth--Qwen3.6-27B-GGUF/UD-Q8_K_XL/Qwen3.6-27B-UD-Q8_K_XL.gguf \
  -t 16 \
  -dev Vulkan0/Vulkan1 \
  -ts 30/30 \
  -sm layer \
  -ngl 99 \
  -fa 1 \
  -ub 2048 -b 2048 \
  --mmap 0 \
  --no-warmup \
  --cache-type-k q8_0 --cache-type-v q8_0

./llama-bench --help (the output you pasted) is the authoritative list of what your build accepts — anything not in it is a cli/server-only flag.

1 reply

d-shehu Jun 1, 2026
Author

Thank you for the quick reply. I came up with something similar but it's great to get confirmation. And the explanation is useful!

Spec decoding aside, I was hoping to use llama-bench to sanity check my prompt processing performance which seemed unusually slow with Hermes agent.

I hoped to test with large context (32K -> 256K) to see degradation in a controlled test. But I can only get up 16K working despite 256K working perfectly with llama-server without spilling over to RAM.

Also adding this kv quant from your command causes an error for me.

--cache-type-k q8_0 --cache-type-v q8_0

Error:

main: error: failed to create context with model '/home/automaton/models/models--unsloth--Qwen3.6-27B-GGUF/UD-Q8_K_XL/Qwen3.6-27B-UD-Q8_K_XL.gguf'

d-shehu · 2026-06-01T18:18:01Z

d-shehu
Jun 1, 2026
Author

There are also seems to be a bug where llama-bench doesn't work with q8 for kv cache. This prevents benchmarking it with more realistic context sizes i.e. greater than 16K.

32K and above causes it to spill over into system ram and start churning the CPUs despite there being plenty of free VRAM even with FP16 kv cache.

Nor does it seem to work with speculative decoding. Only trivial cases like this seem to work.

./llama-bench -m ./models/models--unsloth--Qwen3.6-27B-GGUF/UD-Q8_K_XL/Qwen3.6-27B-UD-Q8_K_XL.gguf  -t 16   -dev Vulkan0/Vulkan1   -ts 30/30   --mmap 0   -ngl 999   -sm layer  -fa auto   -ub 2048 -b 2048 -p 2048 -r 1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Arguments Between llama-bench and llama-cli / llama-server #23979

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inconsistent Arguments Between llama-bench and llama-cli / llama-server #23979

Uh oh!

Uh oh!

d-shehu Jun 1, 2026

1. -ngl all must be a number

Replies: 2 comments · 1 reply

Uh oh!

Manoj-Gujare Jun 1, 2026

1. -ngl all must be a number

2. This is why it stopped using both GPUs — bench uses /, not ,, for multi-GPU

3. Flags that don't exist in bench (remove them)

Corrected command

Uh oh!

d-shehu Jun 1, 2026 Author

Uh oh!

d-shehu Jun 1, 2026 Author

d-shehu
Jun 1, 2026

1. `-ngl all` must be a number

Replies: 2 comments 1 reply

Manoj-Gujare
Jun 1, 2026

1. `-ngl all` must be a number

2. This is why it stopped using both GPUs — bench uses `/`, not `,`, for multi-GPU

d-shehu Jun 1, 2026
Author

d-shehu
Jun 1, 2026
Author