Skip to content

Commit 0dd8923

Browse files
antmikinkaclaude
andcommitted
feat: Add Qwen3.5 benchmark comparison (2B vs 4B)
- Added benchmark scripts for model evaluation - Qwen3.5-2B-GGUF: 5.32 tok/s (winner) - Qwen3.5-4B-GGUF: 4.67 tok/s - Imported results into dashboard database - Added system information and performance matrices - 2B model shows 12% better throughput and 2.3x parameter efficiency Files: - scripts/benchmark_qwen.py: Benchmark execution tool - scripts/import_benchmarks_direct.py: Database import tool - benchmark_results.json: Raw benchmark data - BENCHMARK_COMPARISON_REPORT.md: Detailed analysis report - SYSTEM_INFO_AND_MATRIX.md: Hardware specs and performance matrices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 47d6e2e commit 0dd8923

File tree

6 files changed

+1302
-0
lines changed

6 files changed

+1302
-0
lines changed

BENCHMARK_COMPARISON_REPORT.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Qwen3.5-2B-GGUF vs Qwen3.5-4B-GGUF Benchmark Report
2+
3+
**Date:** April 7, 2026
4+
**Test Environment:** Lemonade Eval Dashboard
5+
**Backend:** lemonade-server (llamacpp)
6+
**Device:** GPU
7+
8+
---
9+
10+
## Executive Summary
11+
12+
Both Qwen3.5 models were benchmarked using the Lemonade Eval Dashboard with identical test conditions. The **Qwen3.5-2B-GGUF** model demonstrated superior token generation throughput compared to the Qwen3.5-4B-GGUF model.
13+
14+
### Winner: Qwen3.5-2B-GGUF
15+
- **Best TPS:** 5.32 tokens/second (at 64-token prompt)
16+
- **Advantage:** ~12% faster than 4B variant
17+
18+
---
19+
20+
## Detailed Results
21+
22+
### Qwen3.5-2B-GGUF
23+
24+
| Prompt Length | Tokens/Second (Mean) | Std Dev | Latency (Mean) |
25+
|---------------|---------------------|---------|----------------|
26+
| 64 tokens | 5.32 tok/s | ±0.43 | 5.42s |
27+
| 128 tokens | 5.20 tok/s | ±0.19 | 5.90s |
28+
| 256 tokens | 4.43 tok/s | ±0.45 | 6.43s |
29+
30+
**Best Performance:** 5.32 tok/s at 64-token prompt length
31+
32+
### Qwen3.5-4B-GGUF
33+
34+
| Prompt Length | Tokens/Second (Mean) | Std Dev | Latency (Mean) |
35+
|---------------|---------------------|---------|----------------|
36+
| 64 tokens | 4.67 tok/s | ±1.55 | 4.99s |
37+
| 128 tokens | 3.53 tok/s | ±0.54 | 4.31s |
38+
| 256 tokens | 3.82 tok/s | ±0.60 | 6.30s |
39+
40+
**Best Performance:** 4.67 tok/s at 64-token prompt length
41+
42+
---
43+
44+
## Performance Comparison
45+
46+
### Token Generation Speed (TPS)
47+
```
48+
Qwen3.5-2B-GGUF ████████████████████████████████████████ 5.32 tok/s
49+
Qwen3.5-4B-GGUF ████████████████████████████████ 4.67 tok/s
50+
```
51+
52+
### Key Findings
53+
54+
1. **Speed Advantage:** Qwen3.5-2B-GGUF is approximately 12% faster than Qwen3.5-4B-GGUF in token generation
55+
- 2B: 5.32 tok/s (best)
56+
- 4B: 4.67 tok/s (best)
57+
58+
2. **Consistency:** Qwen3.5-2B-GGUF shows more consistent performance across different prompt lengths
59+
- 2B: Lower standard deviation (0.19-0.45)
60+
- 4B: Higher variance at shorter prompts (1.55 std dev at 64 tokens)
61+
62+
3. **Prompt Length Impact:** Both models show decreased performance with longer prompts
63+
- 2B: 17% drop from 64 to 256 tokens
64+
- 4B: 18% drop from 64 to 256 tokens
65+
66+
4. **Latency:** Qwen3.5-4B-GGUF has slightly lower latency at 128-token prompts
67+
- 2B: 5.90s average
68+
- 4B: 4.31s average
69+
70+
---
71+
72+
## Test Configuration
73+
74+
- **Iterations:** 5 benchmark runs + 1 warmup
75+
- **Output Tokens:** 32 tokens per run
76+
- **Prompt Lengths:** 64, 128, 256 tokens
77+
- **Backend:** llamacpp (GPU)
78+
- **Quantization:** Q4_K_XL (UD)
79+
80+
---
81+
82+
## Database Records
83+
84+
Results have been imported into the Lemonade Eval Dashboard database:
85+
- **Models Created:** Qwen3.5-2B-GGUF, Qwen3.5-4B-GGUF
86+
- **Runs Created:** 2 benchmark runs
87+
- **Metrics Recorded:** 12 total metrics (6 per model)
88+
89+
### Database IDs
90+
- Qwen3.5-2B-GGUF Model ID: `3d32510e-27d8-4ce1-93fc-5b59ddc2e343`
91+
- Qwen3.5-4B-GGUF Model ID: `cc182c7f-1c76-489c-891a-b44230352ec9`
92+
93+
---
94+
95+
## Recommendations
96+
97+
### For High-Throughput Applications
98+
**Choose Qwen3.5-2B-GGUF**
99+
- Better sustained token generation rates
100+
- More consistent performance across prompt lengths
101+
- Lower memory footprint
102+
103+
### For Accuracy-Critical Applications
104+
**Consider Qwen3.5-4B-GGUF**
105+
- Larger model may have better reasoning capabilities
106+
- Slightly better latency at medium prompt lengths
107+
- Trade-off between speed and potential accuracy
108+
109+
---
110+
111+
## Files Generated
112+
113+
1. `benchmark_results.json` - Raw benchmark data
114+
2. `scripts/benchmark_qwen.py` - Benchmark script
115+
3. `scripts/import_benchmarks_direct.py` - Database import script
116+
4. `BENCHMARK_COMPARISON_REPORT.md` - This report
117+
118+
---
119+
120+
*Report generated by Lemonade Eval Dashboard Benchmarking System*

SYSTEM_INFO_AND_MATRIX.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# System Information and Benchmark Matrix
2+
3+
## Hardware Configuration
4+
5+
| Component | Specification |
6+
|-----------|---------------|
7+
| Platform | Windows 11 (10.0.26200) |
8+
| Processor | AMD Ryzen AI (Family 26, Model 36) |
9+
| CPU Cores | 24 cores |
10+
| RAM | 79.62 GB |
11+
| Python Version | 3.12.11 |
12+
13+
## Software Stack
14+
15+
| Component | Version/Details |
16+
|-----------|-----------------|
17+
| lemonade-server | 10.0.0 |
18+
| Backend | llamacpp (vulkan) |
19+
| Quantization | Q4_K_XL (UD) |
20+
| GPU Backend | Vulkan |
21+
22+
## Model Benchmark Matrix
23+
24+
### Token Generation Speed (tokens/second)
25+
26+
| Model | 64-token prompt | 128-token prompt | 256-token prompt | Best |
27+
|-------|-----------------|------------------|------------------|------|
28+
| Qwen3.5-2B-GGUF | 5.32 ± 0.43 | 5.20 ± 0.19 | 4.43 ± 0.45 | **5.32** |
29+
| Qwen3.5-4B-GGUF | 4.67 ± 1.55 | 3.53 ± 0.54 | 3.82 ± 0.60 | 4.67 |
30+
| **Advantage** | **+12.3%** | **+31.8%** | **+14.0%** | **2B wins** |
31+
32+
### Latency (Time to First Token in seconds)
33+
34+
| Model | 64-token prompt | 128-token prompt | 256-token prompt | Best |
35+
|-------|-----------------|------------------|------------------|------|
36+
| Qwen3.5-2B-GGUF | 5.42 ± 0.56 | 5.90 ± 0.24 | 6.43 ± 0.69 | 5.42s |
37+
| Qwen3.5-4B-GGUF | 4.99 ± 1.32 | 4.31 ± 0.34 | 6.30 ± 0.74 | **4.31s** |
38+
| **Advantage** | 4B faster | **4B faster** | Similar | **4B wins** |
39+
40+
### Performance per Billion Parameters
41+
42+
| Model | Parameters | Best TPS | TPS per Billion |
43+
|-------|------------|----------|-----------------|
44+
| Qwen3.5-2B-GGUF | ~2B | 5.32 | **2.66 tok/s/B** |
45+
| Qwen3.5-4B-GGUF | ~4B | 4.67 | 1.17 tok/s/B |
46+
| **Efficiency** | | | **2B is 2.3x more efficient** |
47+
48+
## Performance Charts
49+
50+
### Token Generation Speed Comparison
51+
```
52+
Tokens/second (higher is better)
53+
54+
Qwen3.5-2B-GGUF ████████████████████████████████████████ 5.32
55+
Qwen3.5-4B-GGUF ████████████████████████████████ 4.67
56+
0 2 4 6
57+
```
58+
59+
### Performance by Prompt Length
60+
```
61+
TPS by Prompt Length
62+
63+
64 tokens:
64+
2B ████████████████████████████████████████ 5.32
65+
4B ████████████████████████████████ 4.67
66+
67+
128 tokens:
68+
2B ██████████████████████████████████████ 5.20
69+
4B ██████████████████████████ 3.53
70+
71+
256 tokens:
72+
2B ████████████████████████████████ 4.43
73+
4B ██████████████████████████████ 3.82
74+
```
75+
76+
## Model Characteristics
77+
78+
| Attribute | Qwen3.5-2B-GGUF | Qwen3.5-4B-GGUF |
79+
|-----------|-----------------|-----------------|
80+
| Model Size | 1.34 GB | 2.91 GB |
81+
| Quantization | Q4_K_XL | Q4_K_XL |
82+
| Family | Qwen | Qwen |
83+
| Type | LLM (Vision-capable) | LLM (Vision-capable) |
84+
| Best Use Case | High-throughput | Better reasoning |
85+
86+
## Database Records
87+
88+
| Table | Records Created | IDs |
89+
|-------|-----------------|-----|
90+
| models | 2 | 3d32510e..., cc182c7f... |
91+
| runs | 2 | Benchmark runs |
92+
| metrics | 12 | 6 per model |
93+
94+
## Benchmark Methodology
95+
96+
### Test Parameters
97+
| Parameter | Value |
98+
|-----------|-------|
99+
| Iterations | 5 |
100+
| Warmup Runs | 1 |
101+
| Output Tokens | 32 |
102+
| Prompt Lengths | 64, 128, 256 tokens |
103+
| Backend | llamacpp (GPU/Vulkan) |
104+
105+
### Metrics Collected
106+
- **TPS (Tokens Per Second):** Primary throughput metric
107+
- **TTFT (Time To First Token):** Latency metric
108+
- **Standard Deviation:** Consistency measure
109+
- **Min/Max Values:** Performance bounds
110+
111+
## Files Created
112+
113+
| File | Purpose |
114+
|------|---------|
115+
| `scripts/benchmark_qwen.py` | Benchmark execution script |
116+
| `scripts/import_benchmarks.py` | Dashboard API import |
117+
| `scripts/import_benchmarks_direct.py` | Direct DB import |
118+
| `benchmark_results.json` | Raw benchmark data |
119+
| `BENCHMARK_COMPARISON_REPORT.md` | Detailed report |
120+
| `SYSTEM_INFO_AND_MATRIX.md` | This document |
121+
122+
## Conclusions
123+
124+
1. **Speed Winner:** Qwen3.5-2B-GGUF delivers 12% better throughput
125+
2. **Latency Winner:** Qwen3.5-4B-GGUF has marginally better latency at medium prompts
126+
3. **Efficiency Winner:** Qwen3.5-2B-GGUF is 2.3x more parameter-efficient
127+
4. **Recommendation:** Use 2B for high-throughput, 4B for complex reasoning tasks
128+
129+
---
130+
131+
*Generated: April 7, 2026*
132+
*Lemonade Eval Dashboard Benchmarking System*

0 commit comments

Comments
 (0)