Replies: 4 comments 3 replies
-
|
For speculative decoding, the draft model needs to predict and output words very quickly... so I understand that very small models are usually used for this purpose. For example, if it's 35b, about 0.5b to 1b. |
Beta Was this translation helpful? Give feedback.
-
|
At any rate, my question is, before I go and report an issue, is this mode (tp + speculative decoding) supported and/or theoretically possible, or is there some nuance to why it can't work, despite both LLMs fitting in VRAM side-by-side, and my mental model is too shallow? Again, Model A and Model B work with speculative decoding. Model A and Model B also work with split mode tensor, albeit in separate llama.cpp processes. Is there a reason why Model A and Model B can't be loaded with speculative decoding and split mode tensor, or is it just a bug? |
Beta Was this translation helpful? Give feedback.
-
|
I've gotten it to run with a newer build and a smaller quant of the MoE model. But this can't be right. There's no increase in performance at all?? Base model performance (gemma-4-31B-it-q8): Draft model performance (gemma-4-26B-A4B-it-q2_S): Combined base + draft model performance: |
Beta Was this translation helpful? Give feedback.
-
|
https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My use case is to increase the performance using Gemma-4-31B by loading Gemma-4-26B-A4B as a speculative decoder, and splitting the tensors over two 48 GB cards.
I suspect there is a bug preventing this combination of features from working together. I can run both models simultaneously in various configurations, but not in the desired configuration with both tensor parallel and speculative decoding enabled. In that case, the process fails by trying to allocate too much VRAM. But I cannot think of what limitation would rule out this configuration from working.
☑️ 31B (GPU0) + 26B-A4B (GPU1), Split Mode: none, Speculative Decoding
☑️ 31B, 26B-A4B, Split Mode: tensor (across 2 llama.cpp processes, one for each model)
❌✅31B + 26B-A4B, Split Mode: tensor, Speculative Decoding
Beta Was this translation helpful? Give feedback.
All reactions