Open
Description
Using LMStudio client, I've tried the following combinations:
- Llama 3.1 8B Q4 MLX (main) + Llama 3.2 1B Q4 MLX (draft): speed went from 38 t/s (without draft) to 33.93 (with draft)!
- Same thing except I used Llama 3.2 3B (smaller gap with the main model) Q4 MLX: went from 38.7 t/s to 29.29 t/s
- Qwen 2.5 7B Q4 MLX (main) + Qwen 2.5 1B Q4 MLX (draft): 37.08 t/s to 22.54 t/s
MacBook Pro with M1 Pro Chip, 32GB unified memory. I replicated the same results with GGUF models as well.