Skip to content

Speculative Decoding Reduces tok/sec Speed on M1 Pro Mac #103

Open
@ibehnam

Description

@ibehnam

Using LMStudio client, I've tried the following combinations:

  • Llama 3.1 8B Q4 MLX (main) + Llama 3.2 1B Q4 MLX (draft): speed went from 38 t/s (without draft) to 33.93 (with draft)!
  • Same thing except I used Llama 3.2 3B (smaller gap with the main model) Q4 MLX: went from 38.7 t/s to 29.29 t/s
  • Qwen 2.5 7B Q4 MLX (main) + Qwen 2.5 1B Q4 MLX (draft): 37.08 t/s to 22.54 t/s

MacBook Pro with M1 Pro Chip, 32GB unified memory. I replicated the same results with GGUF models as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions