Replies: 8 comments 35 replies
-
|
I think you have wrong "config": on My "small" Ryzen 7940HS (~@45w) 128Go RAM@5600 (8xzen4) and with this bench OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="1,3,5,7,9,11,13,15" \
ZENDNNL_MATMUL_ALGO=2 \
.//llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
-r 3 \
-p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
-p "4096,3072,2048,1536,1024,768,512,384,256,192,128,96,64,48,32,24,16,12,8,4,3,2,1" \
-n 16 \
-pg "512,64" \
-m Meta-Llama-3.1-8B-Instruct/BF16.ggufI get:
On a Ryzen AI Max+ 395 (16xzen5) with OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" \
./llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
-r 3 \
-p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
-p "4096,3072,2048,1536,1024,768,512,384,256,192,128,96,64,48,32,24,16,12,8,4,3,2,1" \
-n 16 \
-pg "512,64" \
-m Meta-Llama-3.1-8B-Instruct/BF16.gguf
I don't know what is the topology of the 7950X but you can try the same as with the MAX Only for me you are faster @4096pp ... did you have a fast "fa" implementation enable by default? If yes we may have even faster pp with ZenDNN+Your FA. |
Beta Was this translation helpful? Give feedback.
-
|
I'm on Ubuntu 22.04 and use the stock compiler (GCC 11.4.0). I had to fix the Then simply or I never had to fool around with thread affinity on that box (and it seems strange one needs to, as it would be very strange if kernel developers wouldn't think of not putting two high utilization threads on the same core). What kind of performance do you get with For |
Beta Was this translation helpful? Give feedback.
-
|
Oh, I see you had an
|
Beta Was this translation helpful? Give feedback.
-
|
But, to be more comprehensive, here is what I get with ZenDNN using your exact command:
And here is what I get with
Btw, when you run the ? |
Beta Was this translation helpful? Give feedback.
-
|
One final thing: if you want to see really fast CPU prompt processing, do the following: The run your benchmark with the just created 8-bit quantized model. On my CPU I get 57% higher PP peak performance than
|
Beta Was this translation helpful? Give feedback.
-
|
Thank you for all this information! Now
My point is to know if it have some interest to add ZenDNN (I think it is better to direcly use the AOCL_gemm!) Note: I'll have a try with -rtr, For the GOMP_CPU_AFFINITY="1,3,5,7,9,11,13,15", this prevents migration between core, It is use on most HPC, and allow better caching managment, the L1/L2 cache did not need to migrate with the process, so can add some perf. |
Beta Was this translation helpful? Give feedback.
-
|
So more analysis with a more "professional" procedure 🤞 OS: fedora silverblue.
cmake is not good for build with ZenDNN on ubuntu (and not fluent with ubuntu...) so no res with it. so... I will be currous to look what we can have with AOCL gemm (with repacking) + your FA... |
Beta Was this translation helpful? Give feedback.
-
|
Wow! Thanks! GCC-15 doesn't unroll a loop with compile-time known number of iterations? You can submit a PR if you want. I guess, one needs to make it a compiler specific macro, and add to all loops over |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I haven't been contributing to
llama.cppsince I left the project in March of 2024, but apparently I'm still one of the top contributors there 20 months later, so got pinged in an RFC and subsequent PR that integrates ZenDNN intollama.cpp. ZenDNN is a matrix multiplication library specifically optimized for AMD CPUs. It supportsbf16andf32GEMM. I haven't put a lot of effort into optimizing inference with floating point models (for me "Inference at the Edge" basically means using quantized models), so I decided to check if this could be something forik_llama.cppto handlebf16andf32models.The RFC and PR provide benchmark results for a big-iron, 96-core Zen4 CPU. I don't have that, but I do have a 16-core Ryzen-7950X, which is also Zen4, so ZenDNN should be optimized for it.
So, pulled and built the PR (it required a minor modification in the
CMakeLists.txtfile) and here is what we get withllama-benchon the 7950X forbf16LlaMA-3-8BI used as recommended
ZENDNNL_MATMUL_ALGO=2. The default (whatever it is), gives a PP performance of 163 t/s.In comparison, here is what we get with the
llama.cppCPU backend:Aha. ZenDNN nearly doubles
llama.cppPP performance, but that's not really hard. TG, on the other hand, is almost 2X lower.How does
ik_llama.cppcompare? Here is what we get:So, 1.27X ZenDNN and 2.44X
llama.cppfor PP. TG is faster thanllama.cppfor 1 and 2 threads, almost fully saturating BW with just 2 threads.llama.cppsomehow manages to saturate at a slight higher TG speed at 4 threads. Both are faster than ZenDNN with 16 threads with just a single thread (so more than 16X better energy efficiency when generating tokens)!Beta Was this translation helpful? Give feedback.
All reactions