Speculative Decoding Reduces tok/sec Speed on M1 Pro Mac

Using LMStudio client, I've tried the following combinations:

- Llama 3.1 8B Q4 MLX (main) + Llama 3.2 1B Q4 MLX (draft): speed went from 38 t/s (without draft) to 33.93 (with draft)!
- Same thing except I used Llama 3.2 3B (smaller gap with the main model) Q4 MLX: went from 38.7 t/s to 29.29 t/s
- Qwen 2.5 7B Q4 MLX (main) + Qwen 2.5 1B Q4 MLX (draft): 37.08 t/s to 22.54 t/s

MacBook Pro with M1 Pro Chip, 32GB unified memory. I replicated the same results with GGUF models as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative Decoding Reduces tok/sec Speed on M1 Pro Mac #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speculative Decoding Reduces tok/sec Speed on M1 Pro Mac #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions