lemonade-sdk
diff --git a/‎BENCHMARK_COMPARISON_REPORT.md‎
Lines changed: 120 additions & 0 deletions b/‎BENCHMARK_COMPARISON_REPORT.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎SYSTEM_INFO_AND_MATRIX.md‎
Lines changed: 132 additions & 0 deletions b/‎SYSTEM_INFO_AND_MATRIX.md‎
Lines changed: 132 additions & 0 deletions
@@ -0,0 +1,120 @@
+# Qwen3.5-2B-GGUF vs Qwen3.5-4B-GGUF Benchmark Report
+
+**Date:** April 7, 2026
+**Test Environment:** Lemonade Eval Dashboard
+**Backend:** lemonade-server (llamacpp)
+**Device:** GPU
+
+---
+
+## Executive Summary
+
+Both Qwen3.5 models were benchmarked using the Lemonade Eval Dashboard with identical test conditions. The **Qwen3.5-2B-GGUF** model demonstrated superior token generation throughput compared to the Qwen3.5-4B-GGUF model.
+
+### Winner: Qwen3.5-2B-GGUF
+- **Best TPS:** 5.32 tokens/second (at 64-token prompt)
+- **Advantage:** ~12% faster than 4B variant
+
+---
+
+## Detailed Results
+
+### Qwen3.5-2B-GGUF
+
+| Prompt Length | Tokens/Second (Mean) | Std Dev | Latency (Mean) |
+|---------------|---------------------|---------|----------------|
+| 64 tokens     | 5.32 tok/s          | ±0.43   | 5.42s          |
+| 128 tokens    | 5.20 tok/s          | ±0.19   | 5.90s          |
+| 256 tokens    | 4.43 tok/s          | ±0.45   | 6.43s          |
+
+**Best Performance:** 5.32 tok/s at 64-token prompt length
+
+### Qwen3.5-4B-GGUF
+
+| Prompt Length | Tokens/Second (Mean) | Std Dev | Latency (Mean) |
+|---------------|---------------------|---------|----------------|
+| 64 tokens     | 4.67 tok/s          | ±1.55   | 4.99s          |
+| 128 tokens    | 3.53 tok/s          | ±0.54   | 4.31s          |
+| 256 tokens    | 3.82 tok/s          | ±0.60   | 6.30s          |
+
+**Best Performance:** 4.67 tok/s at 64-token prompt length
+
+---
+
+## Performance Comparison
+
+### Token Generation Speed (TPS)
+```
+Qwen3.5-2B-GGUF  ████████████████████████████████████████  5.32 tok/s
+Qwen3.5-4B-GGUF  ████████████████████████████████          4.67 tok/s
+```
+
+### Key Findings
+
+1. **Speed Advantage:** Qwen3.5-2B-GGUF is approximately 12% faster than Qwen3.5-4B-GGUF in token generation
+   - 2B: 5.32 tok/s (best)
+   - 4B: 4.67 tok/s (best)
+
+2. **Consistency:** Qwen3.5-2B-GGUF shows more consistent performance across different prompt lengths
+   - 2B: Lower standard deviation (0.19-0.45)
+   - 4B: Higher variance at shorter prompts (1.55 std dev at 64 tokens)
+
+3. **Prompt Length Impact:** Both models show decreased performance with longer prompts
+   - 2B: 17% drop from 64 to 256 tokens
+   - 4B: 18% drop from 64 to 256 tokens
+
+4. **Latency:** Qwen3.5-4B-GGUF has slightly lower latency at 128-token prompts
+   - 2B: 5.90s average
+   - 4B: 4.31s average
+
+---
+
+## Test Configuration
+
+- **Iterations:** 5 benchmark runs + 1 warmup
+- **Output Tokens:** 32 tokens per run
+- **Prompt Lengths:** 64, 128, 256 tokens
+- **Backend:** llamacpp (GPU)
+- **Quantization:** Q4_K_XL (UD)
+
+---
+
+## Database Records
+
+Results have been imported into the Lemonade Eval Dashboard database:
+- **Models Created:** Qwen3.5-2B-GGUF, Qwen3.5-4B-GGUF
+- **Runs Created:** 2 benchmark runs
+- **Metrics Recorded:** 12 total metrics (6 per model)
+
+### Database IDs
+- Qwen3.5-2B-GGUF Model ID: `3d32510e-27d8-4ce1-93fc-5b59ddc2e343`
+- Qwen3.5-4B-GGUF Model ID: `cc182c7f-1c76-489c-891a-b44230352ec9`
+
+---
+
+## Recommendations
+
+### For High-Throughput Applications
+**Choose Qwen3.5-2B-GGUF**
+- Better sustained token generation rates
+- More consistent performance across prompt lengths
+- Lower memory footprint
+
+### For Accuracy-Critical Applications
+**Consider Qwen3.5-4B-GGUF**
+- Larger model may have better reasoning capabilities
+- Slightly better latency at medium prompt lengths
+- Trade-off between speed and potential accuracy
+
+---
+
+## Files Generated
+
+1. `benchmark_results.json` - Raw benchmark data
+2. `scripts/benchmark_qwen.py` - Benchmark script
+3. `scripts/import_benchmarks_direct.py` - Database import script
+4. `BENCHMARK_COMPARISON_REPORT.md` - This report
+
+---
+
+*Report generated by Lemonade Eval Dashboard Benchmarking System*
@@ -0,0 +1,132 @@
+# System Information and Benchmark Matrix
+
+## Hardware Configuration
+
+| Component | Specification |
+|-----------|---------------|
+| Platform | Windows 11 (10.0.26200) |
+| Processor | AMD Ryzen AI (Family 26, Model 36) |
+| CPU Cores | 24 cores |
+| RAM | 79.62 GB |
+| Python Version | 3.12.11 |
+
+## Software Stack
+
+| Component | Version/Details |
+|-----------|-----------------|
+| lemonade-server | 10.0.0 |
+| Backend | llamacpp (vulkan) |
+| Quantization | Q4_K_XL (UD) |
+| GPU Backend | Vulkan |
+
+## Model Benchmark Matrix
+
+### Token Generation Speed (tokens/second)
+
+| Model | 64-token prompt | 128-token prompt | 256-token prompt | Best |
+|-------|-----------------|------------------|------------------|------|
+| Qwen3.5-2B-GGUF | 5.32 ± 0.43 | 5.20 ± 0.19 | 4.43 ± 0.45 | **5.32** |
+| Qwen3.5-4B-GGUF | 4.67 ± 1.55 | 3.53 ± 0.54 | 3.82 ± 0.60 | 4.67 |
+| **Advantage** | **+12.3%** | **+31.8%** | **+14.0%** | **2B wins** |
+
+### Latency (Time to First Token in seconds)
+
+| Model | 64-token prompt | 128-token prompt | 256-token prompt | Best |
+|-------|-----------------|------------------|------------------|------|
+| Qwen3.5-2B-GGUF | 5.42 ± 0.56 | 5.90 ± 0.24 | 6.43 ± 0.69 | 5.42s |
+| Qwen3.5-4B-GGUF | 4.99 ± 1.32 | 4.31 ± 0.34 | 6.30 ± 0.74 | **4.31s** |
+| **Advantage** | 4B faster | **4B faster** | Similar | **4B wins** |
+
+### Performance per Billion Parameters
+
+| Model | Parameters | Best TPS | TPS per Billion |
+|-------|------------|----------|-----------------|
+| Qwen3.5-2B-GGUF | ~2B | 5.32 | **2.66 tok/s/B** |
+| Qwen3.5-4B-GGUF | ~4B | 4.67 | 1.17 tok/s/B |
+| **Efficiency** | | | **2B is 2.3x more efficient** |
+
+## Performance Charts
+
+### Token Generation Speed Comparison
+```
+Tokens/second (higher is better)
+
+Qwen3.5-2B-GGUF  ████████████████████████████████████████  5.32
+Qwen3.5-4B-GGUF  ████████████████████████████████          4.67
+                 0        2        4        6
+```
+
+### Performance by Prompt Length
+```
+TPS by Prompt Length
+
+64 tokens:
+  2B ████████████████████████████████████████  5.32
+  4B ████████████████████████████████          4.67
+
+128 tokens:
+  2B ██████████████████████████████████████    5.20
+  4B ██████████████████████████                3.53
+
+256 tokens:
+  2B ████████████████████████████████          4.43
+  4B ██████████████████████████████            3.82
+```
+
+## Model Characteristics
+
+| Attribute | Qwen3.5-2B-GGUF | Qwen3.5-4B-GGUF |
+|-----------|-----------------|-----------------|
+| Model Size | 1.34 GB | 2.91 GB |
+| Quantization | Q4_K_XL | Q4_K_XL |
+| Family | Qwen | Qwen |
+| Type | LLM (Vision-capable) | LLM (Vision-capable) |
+| Best Use Case | High-throughput | Better reasoning |
+
+## Database Records
+
+| Table | Records Created | IDs |
+|-------|-----------------|-----|
+| models | 2 | 3d32510e..., cc182c7f... |
+| runs | 2 | Benchmark runs |
+| metrics | 12 | 6 per model |
+
+## Benchmark Methodology
+
+### Test Parameters
+| Parameter | Value |
+|-----------|-------|
+| Iterations | 5 |
+| Warmup Runs | 1 |
+| Output Tokens | 32 |
+| Prompt Lengths | 64, 128, 256 tokens |
+| Backend | llamacpp (GPU/Vulkan) |
+
+### Metrics Collected
+- **TPS (Tokens Per Second):** Primary throughput metric
+- **TTFT (Time To First Token):** Latency metric
+- **Standard Deviation:** Consistency measure
+- **Min/Max Values:** Performance bounds
+
+## Files Created
+
+| File | Purpose |
+|------|---------|
+| `scripts/benchmark_qwen.py` | Benchmark execution script |
+| `scripts/import_benchmarks.py` | Dashboard API import |
+| `scripts/import_benchmarks_direct.py` | Direct DB import |
+| `benchmark_results.json` | Raw benchmark data |
+| `BENCHMARK_COMPARISON_REPORT.md` | Detailed report |
+| `SYSTEM_INFO_AND_MATRIX.md` | This document |
+
+## Conclusions
+
+1. **Speed Winner:** Qwen3.5-2B-GGUF delivers 12% better throughput
+2. **Latency Winner:** Qwen3.5-4B-GGUF has marginally better latency at medium prompts
+3. **Efficiency Winner:** Qwen3.5-2B-GGUF is 2.3x more parameter-efficient
+4. **Recommendation:** Use 2B for high-throughput, 4B for complex reasoning tasks
+
+---
+
+*Generated: April 7, 2026*
+*Lemonade Eval Dashboard Benchmarking System*