Does llama.cpp ACTUALLY support pipeline parallelism? #20252
Replies: 3 comments 5 replies
-
|
Yes it is supported. You can read more about how it works in #6017. If you configure it correctly, the PP performance scales nearly linear with the number of devices, even for single request. |
Beta Was this translation helpful? Give feedback.
-
|
I just tried this on 4xA40 GPUs, and I can see good scaling. |
Beta Was this translation helpful? Give feedback.
-
|
@marlin-oss Hi! Did you figure it out? I see the same thing on dual RX6800. There is exactly zero difference between Seeing that there is a behavior to disable pipeline parallelism if you're low on memory, maybe it's disabled by something else as well, but without logging anything? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm really scratching my head here. The log says "llama_context: pipeline parallelism enabled". As far as I can tell, with layer split, it's only "batch parallel" or "pipeline sequential". Based on my understanding of the term "pipeline parallel", a model split between N GPUs should be able process N concurrent requests "roughly" N times faster than a single request (minus overhead)
For example, with 2 GPUs and 2 Requests (prompt processing):
While one gpu is idle, waiting for the others, it starts processing the next request - like a pipeline. I do not see this behavior. Only 1 GPU is ever processing at any given time. This makes token generation faster, but has minimal benefit for prompt processing.
I've tried every combination of batch, physical batch, continuous batching, parallel, and other flags I can think of. Am I missing something here? Is there a build flag?
Any help is much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions