Gemma 4: Split Mode Tensor + Speculative Decoding #21975

cmoncure · 2026-04-16T01:11:45Z

cmoncure
Apr 16, 2026

My use case is to increase the performance using Gemma-4-31B by loading Gemma-4-26B-A4B as a speculative decoder, and splitting the tensors over two 48 GB cards.

I suspect there is a bug preventing this combination of features from working together. I can run both models simultaneously in various configurations, but not in the desired configuration with both tensor parallel and speculative decoding enabled. In that case, the process fails by trying to allocate too much VRAM. But I cannot think of what limitation would rule out this configuration from working.

☑️ 31B (GPU0) + 26B-A4B (GPU1), Split Mode: none, Speculative Decoding

  -fa on
  --ctx-size 262144
  -b 4096
  -ub 4096
  --numa isolate
  --threads 16
  --threads-batch 32
  -sm none
  -mg 0 
  --gpu-layers 99
  -m "$model"
  -md "$draft_model"
  -devd 1
  -ngld 99
  --draft-n 4

☑️ 31B, 26B-A4B, Split Mode: tensor (across 2 llama.cpp processes, one for each model)

  -fa on
  --ctx-size 262144
  -b 4096
  -ub 4096
  --numa isolate
  --threads 16
  --threads-batch 32
  -sm tensor
  -ts 1,1
    --gpu-layers 99
  -m "$model"

❌✅31B + 26B-A4B, Split Mode: tensor, Speculative Decoding

  -fa on
  --ctx-size 262144
  -b 4096
  -ub 4096
  --numa isolate
  --threads 16
  --threads-batch 32
  -sm tensor
  -ts 1,1 
  --gpu-layers 99
  -m "$model"
  -md "$draft_model"
  -ngld 99
  --draft-n 4

baramofme · 2026-04-16T15:21:30Z

baramofme
Apr 16, 2026

For speculative decoding, the draft model needs to predict and output words very quickly... so I understand that very small models are usually used for this purpose.

For example, if it's 35b, about 0.5b to 1b.

2 replies

cmoncure Apr 17, 2026
Author

Theoretically speculative decoding execution can be modeled like Min(TG(s) * AR, PP(b))

TG(s) := token generation speed of small model
AR := specdec acceptance rate
PP(b) := prompt processing speed of big model

I use Gemma-4-31B as b, Gemma-4-26B-A4B as s. The smaller models are not fit for purpose as they have only half the context length.

Since for me PP(b) is like 2500 t/s, and TG(b) is 45 t/s, but I get TG(s) 150 t/s and and acceptance rate of ~0.7, I would expect a theoretical max performance of TG(spec) at 105 t/s, which is a huge speedup.

baramofme Apr 17, 2026

To think the acceptance rate is 0.7... that's incredible.

cmoncure · 2026-04-17T13:30:13Z

cmoncure
Apr 17, 2026
Author

At any rate, my question is, before I go and report an issue, is this mode (tp + speculative decoding) supported and/or theoretically possible, or is there some nuance to why it can't work, despite both LLMs fitting in VRAM side-by-side, and my mental model is too shallow?

Again, Model A and Model B work with speculative decoding. Model A and Model B also work with split mode tensor, albeit in separate llama.cpp processes. Is there a reason why Model A and Model B can't be loaded with speculative decoding and split mode tensor, or is it just a bug?

0 replies

cmoncure · 2026-05-01T03:22:15Z

cmoncure
May 1, 2026
Author

I've gotten it to run with a newer build and a smaller quant of the MoE model. But this can't be right. There's no increase in performance at all??

Base model performance (gemma-4-31B-it-q8):
prompt eval time = 795.01 ms / 2033 tokens ( 0.39 ms per token, 2557.19 tokens per second)
eval time = 13667.24 ms / 641 tokens ( 21.32 ms per token, 46.90 tokens per second)
total time = 14462.25 ms / 2674 tokens

Draft model performance (gemma-4-26B-A4B-it-q2_S):
prompt eval time = 287.53 ms / 2033 tokens ( 0.14 ms per token, 7070.57 tokens per second)
eval time = 9316.17 ms / 1735 tokens ( 5.37 ms per token, 186.24 tokens per second)
total time = 9603.70 ms / 3768 tokens

Combined base + draft model performance:
prompt eval time = 797.26 ms / 2033 tokens ( 0.39 ms per token, 2549.98 tokens per second)
eval time = 14271.34 ms / 689 tokens ( 20.71 ms per token, 48.28 tokens per second)
total time = 15068.60 ms / 2722 tokens
draft acceptance rate = 0.70522 ( 378 accepted / 536 generated)
statistics draft: #calls(b,g,a) = 1 310 165, #gen drafts = 198, #acc drafts = 165, #gen tokens = 536, #acc tokens = 378, dur(b,g,a) = 0.001, 4919.207, 0.049 ms

0 replies

powerman · 2026-05-06T09:00:52Z

powerman
May 6, 2026

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
Any chance this will be supported by llama.cpp?

1 reply

julianlam May 7, 2026

#22673 is the PR to follow for MTP support in a combined model file. The separate assistant models (e.g. gemma-4-26B-A4B-it-assistant) are available to use separately, but do not come with llama.cpp compatible GGUFs 😢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4: Split Mode Tensor + Speculative Decoding #21975

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Gemma 4: Split Mode Tensor + Speculative Decoding #21975

Uh oh!

Uh oh!

cmoncure Apr 16, 2026

Replies: 4 comments · 3 replies

Uh oh!

baramofme Apr 16, 2026

Uh oh!

cmoncure Apr 17, 2026 Author

Uh oh!

baramofme Apr 17, 2026

Uh oh!

cmoncure Apr 17, 2026 Author

Uh oh!

cmoncure May 1, 2026 Author

Uh oh!

powerman May 6, 2026

Uh oh!

Uh oh!

julianlam May 7, 2026

cmoncure
Apr 16, 2026

Replies: 4 comments 3 replies

baramofme
Apr 16, 2026

cmoncure Apr 17, 2026
Author

cmoncure
Apr 17, 2026
Author

cmoncure
May 1, 2026
Author

powerman
May 6, 2026