Commit 2154a0f
authored
CUDA: enroll mul_mat_vec_q_moe into pdl (#24087)
* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW
Data collected on a B4500:
Before
```
(llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py
code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py
code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```
Server launched with:
```
➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
-m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-ngl all \
-fa on \
--host 0.0.0.0 \
--port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```
* LC to overlap with following kernels1 parent 46fa662 commit 2154a0f
1 file changed
Lines changed: 11 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
682 | 682 | | |
683 | 683 | | |
684 | 684 | | |
685 | | - | |
686 | | - | |
| 685 | + | |
| 686 | + | |
687 | 687 | | |
688 | 688 | | |
689 | 689 | | |
690 | 690 | | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
691 | 695 | | |
692 | 696 | | |
693 | 697 | | |
| |||
707 | 711 | | |
708 | 712 | | |
709 | 713 | | |
| 714 | + | |
710 | 715 | | |
711 | 716 | | |
712 | 717 | | |
| |||
726 | 731 | | |
727 | 732 | | |
728 | 733 | | |
| 734 | + | |
| 735 | + | |
729 | 736 | | |
730 | 737 | | |
731 | 738 | | |
| |||
794 | 801 | | |
795 | 802 | | |
796 | 803 | | |
| 804 | + | |
797 | 805 | | |
798 | | - | |
| 806 | + | |
799 | 807 | | |
800 | 808 | | |
801 | 809 | | |
| |||
0 commit comments