Does llama.cpp ACTUALLY support pipeline parallelism? #20252

marlin-oss · 2026-03-08T23:11:51Z

marlin-oss
Mar 8, 2026

I'm really scratching my head here. The log says "llama_context: pipeline parallelism enabled". As far as I can tell, with layer split, it's only "batch parallel" or "pipeline sequential". Based on my understanding of the term "pipeline parallel", a model split between N GPUs should be able process N concurrent requests "roughly" N times faster than a single request (minus overhead)

For example, with 2 GPUs and 2 Requests (prompt processing):

Step 1:
- GPU 1: processing request 1
- GPU 2: idle
Step 2:
- GPU 1: processing request 2
- GPU 2: processing request 1
Step 3:
- GPU 1: processing request 1
- GPU 2: processing request 2
...
Last step:
- GPU 1: idle
- GPU 2: processing request 2

While one gpu is idle, waiting for the others, it starts processing the next request - like a pipeline. I do not see this behavior. Only 1 GPU is ever processing at any given time. This makes token generation faster, but has minimal benefit for prompt processing.

I've tried every combination of batch, physical batch, continuous batching, parallel, and other flags I can think of. Am I missing something here? Is there a build flag?

Any help is much appreciated.

ggerganov · 2026-03-09T09:03:39Z

ggerganov
Mar 9, 2026
Maintainer

Yes it is supported. You can read more about how it works in #6017. If you configure it correctly, the PP performance scales nearly linear with the number of devices, even for single request.

3 replies

marlin-oss Mar 9, 2026
Author

Thanks for the reply, I appreciate it.
If I understand the flags correctly, setting batch = ubatch should essentially disable pipeline parallelism?
With 2 GPUs and layer split, I see zero performance difference between -ub 512 -b 512 and -ub 512 -b 1024 (or larger or smaller overall) in llama-server or llama-batched-bench. Does this indicate that it isn't being enabled on my system?

In llama-batched-bench -ub 512 -b 512, I see:

main: n_kv_max = 65536, n_batch = 512, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   19.837 |   412.96 |    3.483 |     9.19 |   23.320 |   352.66 |
|  8192 |     32 |    2 |  16448 |   38.933 |   420.82 |    3.729 |    17.16 |   42.662 |   385.54 |

-ub 512 -b 1024

main: n_kv_max = 65536, n_batch = 1024, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   20.005 |   409.51 |    3.434 |     9.32 |   23.438 |   350.88 |
|  8192 |     32 |    2 |  16448 |   39.201 |   417.95 |    3.688 |    17.35 |   42.889 |   383.50 |

This matches the performance I see in llama-server

Full log -ub 512 -b 512

./llama-batched-bench --device CUDA1,CUDA3 --model '/mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf' -ngl 999 -c 65536 -ub 512 -b 512 -npp 8192 -ntg 32 -npl 1,2 -ts 1,1 -sm layer
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
build: 8233 (c5a778891) with GNU 12.3.0 for Linux x86_64
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:01:00.0) - 24292 MiB free
llama_model_load_from_file_impl: using device CUDA3 (Tesla P40) (0000:03:00.0) - 24292 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 363 tensors from /mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mistral3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   3:                            general.version str              = 2512
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Devstral Small 2 24B Instruct 2512
llama_model_loader: - kv  12:               general.base_model.0.version str              = 2512
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Devs...
llama_model_loader: - kv  15:                               general.tags arr[str,2]       = ["mistral-common", "unsloth"]
llama_model_loader: - kv  16:                       mistral3.block_count u32              = 40
llama_model_loader: - kv  17:                    mistral3.context_length u32              = 393216
llama_model_loader: - kv  18:                  mistral3.embedding_length u32              = 5120
llama_model_loader: - kv  19:               mistral3.feed_forward_length u32              = 32768
llama_model_loader: - kv  20:              mistral3.attention.head_count u32              = 32
llama_model_loader: - kv  21:           mistral3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  22:                    mistral3.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  23:  mistral3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  24:              mistral3.attention.key_length u32              = 128
llama_model_loader: - kv  25:            mistral3.attention.value_length u32              = 128
llama_model_loader: - kv  26:              mistral3.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                 mistral3.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:               mistral3.rope.scaling.factor f32              = 48.000000
llama_model_loader: - kv  29:       mistral3.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  30:       mistral3.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  31:  mistral3.rope.scaling.yarn_log_multiplier f32              = 1.000000
llama_model_loader: - kv  32: mistral3.rope.scaling.original_context_length u32              = 8192
llama_model_loader: - kv  33:       mistral3.attention.temperature_scale f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  42:                      tokenizer.ggml.scores arr[i32,131072]  = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  44:                        mistral3.vocab_size u32              = 131072
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {#- Unsloth template fixes #}\n{%- set...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 7
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = Devstral-Small-2-24B-Instruct-2512-GG...
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = unsloth_calibration_Devstral-Small-2-...
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 280
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 75
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   67 tensors
llama_model_loader: - type q8_0:  215 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 26.99 GiB (9.84 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch                  = mistral3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 393216
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 40
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 32768
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 100000000.0
print_info: freq_scale_train      = 0.0208333
print_info: n_ctx_orig_yarn       = 8192
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 14B
print_info: model params          = 23.57 B
print_info: general.name          = Devstral-Small-2-24B-Instruct-2512
print_info: vocab type            = BPE
print_info: n_vocab               = 131072
print_info: n_merges              = 269443
print_info: BOS token             = 1 '<s>'
print_info: EOS token             = 2 '</s>'
print_info: UNK token             = 0 '<unk>'
print_info: PAD token             = 11 '<pad>'
print_info: LF token              = 1010 'Ċ'
print_info: EOG token             = 2 '</s>'
print_info: max token length      = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1280.00 MiB
load_tensors:        CUDA1 model buffer size = 13345.20 MiB
load_tensors:        CUDA3 model buffer size = 13016.07 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000000.0
llama_context: freq_scale    = 0.0208333
llama_context: n_ctx_seq (32768) < n_ctx_train (393216) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  5376.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  4864.00 MiB
llama_kv_cache: size = 10240.00 MiB ( 32768 cells,  40 layers,  2/2 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA1 compute buffer size =   556.05 MiB
sched_reserve:      CUDA3 compute buffer size =   434.05 MiB
sched_reserve:  CUDA_Host compute buffer size =   276.06 MiB
sched_reserve: graph nodes  = 1367
sched_reserve: graph splits = 3
sched_reserve: reserve took 221.00 ms, sched copies = 4

main: n_kv_max = 65536, n_batch = 512, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   19.837 |   412.96 |    3.483 |     9.19 |   23.320 |   352.66 |
|  8192 |     32 |    2 |  16448 |   38.933 |   420.82 |    3.729 |    17.16 |   42.662 |   385.54 |

llama_perf_context_print:        load time =   10751.65 ms
llama_perf_context_print: prompt eval time =   62784.77 ms / 24656 tokens (    2.55 ms per token,   392.71 tokens per second)
llama_perf_context_print:        eval time =    3482.22 ms /    32 runs   (  108.82 ms per token,     9.19 tokens per second)
llama_perf_context_print:       total time =   76738.57 ms / 24688 tokens
llama_perf_context_print:    graphs reused =          0

Full log -ub 512 -b 1024

./llama-batched-bench --device CUDA1,CUDA3 --model '/mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf' -ngl 999 -c 65536 -ub 512 -b 1024 -npp 8192 -ntg 32 -npl 1,2 -ts 1,1 -sm layer
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
build: 8233 (c5a778891) with GNU 12.3.0 for Linux x86_64
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:01:00.0) - 24292 MiB free
llama_model_load_from_file_impl: using device CUDA3 (Tesla P40) (0000:03:00.0) - 24292 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 363 tensors from /mnt/tmpfs/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mistral3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   3:                            general.version str              = 2512
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Devstral-Small-2-24B-Instruct-2512
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Devstral Small 2 24B Instruct 2512
llama_model_loader: - kv  12:               general.base_model.0.version str              = 2512
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Devs...
llama_model_loader: - kv  15:                               general.tags arr[str,2]       = ["mistral-common", "unsloth"]
llama_model_loader: - kv  16:                       mistral3.block_count u32              = 40
llama_model_loader: - kv  17:                    mistral3.context_length u32              = 393216
llama_model_loader: - kv  18:                  mistral3.embedding_length u32              = 5120
llama_model_loader: - kv  19:               mistral3.feed_forward_length u32              = 32768
llama_model_loader: - kv  20:              mistral3.attention.head_count u32              = 32
llama_model_loader: - kv  21:           mistral3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  22:                    mistral3.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  23:  mistral3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  24:              mistral3.attention.key_length u32              = 128
llama_model_loader: - kv  25:            mistral3.attention.value_length u32              = 128
llama_model_loader: - kv  26:              mistral3.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                 mistral3.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:               mistral3.rope.scaling.factor f32              = 48.000000
llama_model_loader: - kv  29:       mistral3.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  30:       mistral3.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  31:  mistral3.rope.scaling.yarn_log_multiplier f32              = 1.000000
llama_model_loader: - kv  32: mistral3.rope.scaling.original_context_length u32              = 8192
llama_model_loader: - kv  33:       mistral3.attention.temperature_scale f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  42:                      tokenizer.ggml.scores arr[i32,131072]  = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  44:                        mistral3.vocab_size u32              = 131072
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  46:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {#- Unsloth template fixes #}\n{%- set...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 7
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = Devstral-Small-2-24B-Instruct-2512-GG...
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = unsloth_calibration_Devstral-Small-2-...
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 280
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 75
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   67 tensors
llama_model_loader: - type q8_0:  215 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 26.99 GiB (9.84 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch                  = mistral3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 393216
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 40
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 32768
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 100000000.0
print_info: freq_scale_train      = 0.0208333
print_info: n_ctx_orig_yarn       = 8192
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 14B
print_info: model params          = 23.57 B
print_info: general.name          = Devstral-Small-2-24B-Instruct-2512
print_info: vocab type            = BPE
print_info: n_vocab               = 131072
print_info: n_merges              = 269443
print_info: BOS token             = 1 '<s>'
print_info: EOS token             = 2 '</s>'
print_info: UNK token             = 0 '<unk>'
print_info: PAD token             = 11 '<pad>'
print_info: LF token              = 1010 'Ċ'
print_info: EOG token             = 2 '</s>'
print_info: max token length      = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1280.00 MiB
load_tensors:        CUDA1 model buffer size = 13345.20 MiB
load_tensors:        CUDA3 model buffer size = 13016.07 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000000.0
llama_context: freq_scale    = 0.0208333
llama_context: n_ctx_seq (32768) < n_ctx_train (393216) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  5376.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  4864.00 MiB
llama_kv_cache: size = 10240.00 MiB ( 32768 cells,  40 layers,  2/2 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA1 compute buffer size =   556.05 MiB
sched_reserve:      CUDA3 compute buffer size =   434.05 MiB
sched_reserve:  CUDA_Host compute buffer size =   276.06 MiB
sched_reserve: graph nodes  = 1367
sched_reserve: graph splits = 3
sched_reserve: reserve took 222.68 ms, sched copies = 4

main: n_kv_max = 65536, n_batch = 1024, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 999, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   20.005 |   409.51 |    3.434 |     9.32 |   23.438 |   350.88 |
|  8192 |     32 |    2 |  16448 |   39.201 |   417.95 |    3.688 |    17.35 |   42.889 |   383.50 |

llama_perf_context_print:        load time =   10581.30 ms
llama_perf_context_print: prompt eval time =   63174.30 ms / 24656 tokens (    2.56 ms per token,   390.29 tokens per second)
llama_perf_context_print:        eval time =    3433.51 ms /    32 runs   (  107.30 ms per token,     9.32 tokens per second)
llama_perf_context_print:       total time =   76913.99 ms / 24688 tokens
llama_perf_context_print:    graphs reused =          0

ggerganov Mar 10, 2026
Maintainer

Hm not sure. Try to run the llama-bench tests from the #6017. If you can't reproduce the results, either something regressed or there is something specific to your system.

marlin-oss Mar 11, 2026
Author

I ran the test and saw essentially no difference. I should be seeing some improvement here, right?
nvtop shows a sawtooth pattern where only one gpu is active at any given time.

cmake -B build -DGGML_CUDA=ON  -DGGML_BLAS=OFF -DLLAMA_CURL=OFF -DGGML_SCHED_MAX_COPIES=8

llama-bench --model '/mnt/tmpfs/Qwen3-14B-UD-Q8_K_XL.gguf' -ngl 999 --device CUDA1,CUDA1/CUDA2,CUDA1/CUDA2/CUDA3 -p 512,1024,2048,4096,8192 -b 8192
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 84348 MiB):
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 11011 MiB (10848 MiB free)
  Device 1: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)
  Device 2: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)
  Device 3: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 24445 MiB (24293 MiB free)

model	size	params	backend	ngl	n_batch	dev	test	t/s
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp512	410.94 ± 0.16
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp1024	402.04 ± 0.21
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp2048	385.97 ± 0.31
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp4096	356.10 ± 0.42
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	pp8192	292.27 ± 3.76
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1	tg128	15.17 ± 0.03
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp512	407.75 ± 0.73
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp1024	414.67 ± 0.54
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp2048	408.47 ± 0.45
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp4096	381.33 ± 0.18
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	pp8192	333.17 ± 0.28
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2	tg128	15.08 ± 0.03
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp512	407.75 ± 0.78
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp1024	414.82 ± 0.78
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp2048	407.60 ± 0.34
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp4096	380.85 ± 0.33
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	pp8192	332.99 ± 0.34
qwen3 14B Q8_0	17.46 GiB	14.77 B	CUDA	999	8192	CUDA1/CUDA2/CUDA3	tg128	15.17 ± 0.02

build: 4d99d4508 (8279)

I tested with a 24B model (on 2 and 3 gpus) also and saw a similar lack of improvement.
I also tested with CUDA_SCALE_LAUNCH_QUEUES=4x; which made no difference, for what it's worth.

gaugarg-nv · 2026-03-11T08:44:32Z

gaugarg-nv
Mar 11, 2026
Collaborator

I just tried this on 4xA40 GPUs, and I can see good scaling.

cmake -B build-A40 -DGGML_CUDA=ON
build-A40/bin/llama-bench -m ../models/Qwen3-14B-UD-Q8_K_XL.gguf -ngl 999 --device CUDA1,CUDA1/CUDA2,CUDA1/CUDA2/CUDA3 -p 512,1024,2048,4096,8192 -b 8192
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 181960 MiB):
  Device 0: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 1: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 2: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
  Device 3: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 45490 MiB (45221 MiB free)
| model                          |       size |     params | backend    | ngl | n_batch | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |           pp512 |      2393.11 ± 57.80 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp1024 |       2286.60 ± 6.00 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp2048 |       2130.31 ± 4.01 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp4096 |       1857.38 ± 2.14 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |          pp8192 |       1462.26 ± 0.80 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1        |           tg128 |         31.69 ± 0.02 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |           pp512 |      2425.42 ± 17.42 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp1024 |       2915.99 ± 1.91 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp2048 |       3199.68 ± 1.65 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp4096 |       3052.61 ± 0.72 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |          pp8192 |       2546.45 ± 0.33 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2  |           tg128 |         31.78 ± 0.01 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |           pp512 |      2429.78 ± 11.10 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp1024 |       3255.04 ± 2.40 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp2048 |       3956.43 ± 3.79 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp4096 |       4044.25 ± 0.52 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |          pp8192 |       3530.37 ± 1.05 |
| qwen3 14B Q8_0                 |  17.46 GiB |    14.77 B | CUDA       | 999 |    8192 | CUDA1/CUDA2/CUDA3 |           tg128 |         31.78 ± 0.01 |

build: 5f91b1d5d (8286)

2 replies

marlin-oss Mar 11, 2026
Author

Thanks for testing that. Seems related to my hardware.
Are you on a dual socket by any chance?
Maybe it's P40 related. If someone running P40s could give it a try that would be great.

I've been looking into p2p which they supposedly support, might be causing issues if it's not working properly.
I'll try with just -DGGML_CUDA=ON but I doubt that's it.

gaugarg-nv Mar 11, 2026
Collaborator

Yes, this was a dual socket system. But I have tested on single socket systems too in the past, and it worked fine. I don't have access to P40s, so I can't test.

Can you try capturing nsight trace for batch size 8192, pp8192 case, and share here?

dark-penguin · 2026-05-31T17:06:41Z

dark-penguin
May 31, 2026

@marlin-oss Hi! Did you figure it out? I see the same thing on dual RX6800. There is exactly zero difference between -b 1024 -ub 1024 and -b 2048 -ub 1024, or between runs with W sched_reserve: compute buffer allocation failed, retrying without pipeline parallelism and without it.

Seeing that there is a behavior to disable pipeline parallelism if you're low on memory, maybe it's disabled by something else as well, but without logging anything?

0 replies

Does llama.cpp ACTUALLY support pipeline parallelism? #20252

Uh oh!

marlin-oss Mar 8, 2026

Replies: 3 comments · 5 replies

Uh oh!

ggerganov Mar 9, 2026 Maintainer

Uh oh!

Uh oh!

marlin-oss Mar 9, 2026 Author

Uh oh!

ggerganov Mar 10, 2026 Maintainer

Uh oh!

Uh oh!

marlin-oss Mar 11, 2026 Author

Uh oh!

Uh oh!

gaugarg-nv Mar 11, 2026 Collaborator

Uh oh!

Uh oh!

marlin-oss Mar 11, 2026 Author

Uh oh!

gaugarg-nv Mar 11, 2026 Collaborator

Uh oh!

dark-penguin May 31, 2026

marlin-oss
Mar 8, 2026

Replies: 3 comments 5 replies

ggerganov
Mar 9, 2026
Maintainer

marlin-oss Mar 9, 2026
Author

ggerganov Mar 10, 2026
Maintainer

marlin-oss Mar 11, 2026
Author

gaugarg-nv
Mar 11, 2026
Collaborator

marlin-oss Mar 11, 2026
Author

gaugarg-nv Mar 11, 2026
Collaborator

dark-penguin
May 31, 2026