Skip to content

Feature/kv cache quantization#278

Open
cubist38 wants to merge 3 commits intomainfrom
feature/kv-cache-quantization
Open

Feature/kv cache quantization#278
cubist38 wants to merge 3 commits intomainfrom
feature/kv-cache-quantization

Conversation

@cubist38
Copy link
Copy Markdown
Owner

@cubist38 cubist38 commented Apr 8, 2026

feat: add KV cache quantization support for LM and multimodal models

Closes #276

Summary

  • Add --kv-bits, --kv-group-size, and --quantized-kv-start CLI options and YAML config fields to enable KV cache quantization (including TurboQuant) for lm and multimodal model types
  • Thread the parameters through config, handlers, and model wrappers to the underlying generate_step in both mlx-lm and mlx-vlm
  • Validate and warn when KV cache quantization is specified for unsupported model types (e.g. embeddings, whisper)

Usage

CLI

mlx-openai-server launch --model-path <model> --kv-bits 4 --kv-group-size 64 --quantized-kv-start 0

YAML config

models:
  - model_path: mlx-community/Qwen3-VL-2B-Thinking-8bit
    model_type: multimodal
    kv_bits: 4
    kv_group_size: 64
    quantized_kv_start: 0

Changed files

  • app/cli.py — New --kv-bits, --kv-group-size, --quantized-kv-start CLI options
  • app/config.py — New fields on MLXServerConfig and ModelEntryConfig with validation
  • app/server.py — Pass KV params to LM and VLM handler constructors
  • app/handler/mlx_lm.py — Accept, store, and include KV params in model_params
  • app/handler/mlx_vlm.py — Accept, store, and include KV params in model_params
  • app/models/mlx_lm.py — Forward KV params to stream_generate
  • app/models/mlx_vlm.py — Forward KV params to stream_generate

cubist38 added 3 commits April 8, 2026 09:43
- Introduced kv_bits, kv_group_size, and quantized_kv_start parameters for improved sampling control.
- Enhanced stream_generate function calls in both models to utilize new quantization settings.
- Added kv_bits, kv_group_size, and quantized_kv_start parameters to enhance KV cache quantization.
- Updated initialization and request handling to incorporate new quantization settings for both models.
- Added CLI options for kv_bits, kv_group_size, and quantized_kv_start to enhance user control over KV cache quantization.
- Updated MLXServerConfig and ModelEntryConfig to include new parameters and validation for model type compatibility.
- Modified server handler to incorporate quantization settings during model configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Turboquant cache for multimodal

1 participant