Skip to content

Commit da2326b

Browse files
committed
add example: optimizing code generation
1 parent da46545 commit da2326b

File tree

3 files changed

+169
-6
lines changed

3 files changed

+169
-6
lines changed

examples/README.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,6 @@
1-
# Example Configurations
1+
# Example Configs and Use Cases
22

3-
Learning by example is best.
4-
5-
Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server.
6-
7-
## List
3+
A collections of usecases and examples for getting the most out of llama-swap.
84

95
* [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
6+
* [Optimizing Code Generation](benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Optimizing Code Generation with llama-swap
2+
3+
Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration.
4+
5+
The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40.
6+
7+
**Benchmark Scenarios**
8+
9+
Three scenarios are tested:
10+
11+
- 3090-only: Just the main model on the 3090
12+
- 3090-with-draft: the main and draft models on the 3090
13+
- 3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40
14+
15+
**Available Devices**
16+
17+
Use the following command to list available devices IDs for the configuration:
18+
19+
```
20+
$ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices
21+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
22+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
23+
ggml_cuda_init: found 4 CUDA devices:
24+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
25+
Device 1: Tesla P40, compute capability 6.1, VMM: yes
26+
Device 2: Tesla P40, compute capability 6.1, VMM: yes
27+
Device 3: Tesla P40, compute capability 6.1, VMM: yes
28+
Available devices:
29+
CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free)
30+
CUDA1: Tesla P40 (24438 MiB, 22942 MiB free)
31+
CUDA2: Tesla P40 (24438 MiB, 24144 MiB free)
32+
CUDA3: Tesla P40 (24438 MiB, 24144 MiB free)
33+
```
34+
35+
**Configuration**
36+
37+
The configuration file, `benchmark-config.yaml`, defines the three scenarios:
38+
39+
```yaml
40+
models:
41+
"3090-only":
42+
proxy: "http://127.0.0.1:9503"
43+
cmd: >
44+
/mnt/nvme/llama-server/llama-server-f3252055
45+
--host 127.0.0.1 --port 9503
46+
--flash-attn
47+
--slots
48+
49+
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
50+
-ngl 99
51+
--device CUDA0
52+
53+
--ctx-size 32768
54+
--cache-type-k q8_0 --cache-type-v q8_0
55+
56+
"3090-with-draft":
57+
proxy: "http://127.0.0.1:9503"
58+
# --ctx-size 28500 max that can fit on 3090 after draft model
59+
cmd: >
60+
/mnt/nvme/llama-server/llama-server-f3252055
61+
--host 127.0.0.1 --port 9503
62+
--flash-attn
63+
--slots
64+
65+
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
66+
-ngl 99
67+
--device CUDA0
68+
69+
--model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
70+
-ngld 99
71+
--draft-max 16
72+
--draft-min 4
73+
--draft-p-min 0.4
74+
--device-draft CUDA0
75+
76+
--ctx-size 28500
77+
--cache-type-k q8_0 --cache-type-v q8_0
78+
79+
"3090-P40-draft":
80+
proxy: "http://127.0.0.1:9503"
81+
cmd: >
82+
/mnt/nvme/llama-server/llama-server-f3252055
83+
--host 127.0.0.1 --port 9503
84+
--flash-attn --metrics
85+
--slots
86+
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
87+
-ngl 99
88+
--device CUDA0
89+
90+
--model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
91+
-ngld 99
92+
--draft-max 16
93+
--draft-min 4
94+
--draft-p-min 0.4
95+
--device-draft CUDA1
96+
97+
--ctx-size 32768
98+
--cache-type-k q8_0 --cache-type-v q8_0
99+
```
100+
101+
> Note in the `3090-with-draft` scenario the `--ctx-size` had to be reduced from 32768 to to accommodate the draft model.
102+
103+
104+
**Running the Benchmark**
105+
106+
To run the benchmark, execute the following commands:
107+
108+
1. `llama-swap -config benchmark-config.yaml`
109+
1. `./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"`
110+
111+
The [benchmark script](run-benchmark.sh) generates a CSV output of the results, which can be converted to a Markdown table for readability.
112+
113+
**Results (tokens/second)**
114+
115+
| model | python | typescript | swift |
116+
|-----------------|--------|------------|-------|
117+
| 3090-only | 34.03 | 34.01 | 34.01 |
118+
| 3090-with-draft | 106.65 | 70.48 | 57.89 |
119+
| 3090-P40-draft | 81.54 | 60.35 | 46.50 |
120+
121+
Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware.
122+
123+
Happy coding!
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
#!/usr/bin/env bash
2+
3+
# This script generates a CSV file showing the token/second for generating a Snake Game in python, typescript and swift
4+
# It was created to test the effects of speculative decoding and the various draft settings on performance.
5+
#
6+
# Writing code with a low temperature seems to provide fairly consistent logic.
7+
#
8+
# Usage: ./benchmark.sh <url> <model1> [model2 ...]
9+
# Example: ./benchmark.sh http://localhost:8080 model1 model2
10+
11+
if [ "$#" -lt 2 ]; then
12+
echo "Usage: $0 <url> <model1> [model2 ...]"
13+
exit 1
14+
fi
15+
16+
url=$1; shift
17+
18+
echo "model,python,typescript,swift"
19+
20+
for model in "$@"; do
21+
22+
echo -n "$model,"
23+
24+
for lang in "python" "typescript" "swift"; do
25+
response=$(curl -s --url "$url/v1/chat/completions" -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}")
26+
if [ $? -ne 0 ]; then
27+
time="error"
28+
else
29+
time=$(curl -s --url "$url/logs" | grep -oE '\d+(?:\.\d+)? tokens per second' | awk '{print $1}' | tail -n 1)
30+
if [ $? -ne 0 ]; then
31+
time="error"
32+
fi
33+
fi
34+
35+
if [ "$lang" != "swift" ]; then
36+
echo -n "$time,"
37+
else
38+
echo -n "$time"
39+
fi
40+
done
41+
42+
echo ""
43+
done

0 commit comments

Comments
 (0)