Feature/kv cache quantization by cubist38 · Pull Request #278 · cubist38/mlx-openai-server

cubist38 · 2026-04-08T02:45:18Z

feat: add KV cache quantization support for LM and multimodal models

Closes #276

Summary

Add --kv-bits, --kv-group-size, and --quantized-kv-start CLI options and YAML config fields to enable KV cache quantization (including TurboQuant) for lm and multimodal model types
Thread the parameters through config, handlers, and model wrappers to the underlying generate_step in both mlx-lm and mlx-vlm
Validate and warn when KV cache quantization is specified for unsupported model types (e.g. embeddings, whisper)

Usage

CLI

mlx-openai-server launch --model-path <model> --kv-bits 4 --kv-group-size 64 --quantized-kv-start 0

YAML config

models:
  - model_path: mlx-community/Qwen3-VL-2B-Thinking-8bit
    model_type: multimodal
    kv_bits: 4
    kv_group_size: 64
    quantized_kv_start: 0

Changed files

app/cli.py — New --kv-bits, --kv-group-size, --quantized-kv-start CLI options
app/config.py — New fields on MLXServerConfig and ModelEntryConfig with validation
app/server.py — Pass KV params to LM and VLM handler constructors
app/handler/mlx_lm.py — Accept, store, and include KV params in model_params
app/handler/mlx_vlm.py — Accept, store, and include KV params in model_params
app/models/mlx_lm.py — Forward KV params to stream_generate
app/models/mlx_vlm.py — Forward KV params to stream_generate

- Introduced kv_bits, kv_group_size, and quantized_kv_start parameters for improved sampling control. - Enhanced stream_generate function calls in both models to utilize new quantization settings.

- Added kv_bits, kv_group_size, and quantized_kv_start parameters to enhance KV cache quantization. - Updated initialization and request handling to incorporate new quantization settings for both models.

- Added CLI options for kv_bits, kv_group_size, and quantized_kv_start to enhance user control over KV cache quantization. - Updated MLXServerConfig and ModelEntryConfig to include new parameters and validation for model type compatibility. - Modified server handler to incorporate quantization settings during model configuration.

cubist38 added 3 commits April 8, 2026 09:43

feat: add KV cache quantization parameters to MLX_LM and MLX_VLM

22ee504

- Introduced kv_bits, kv_group_size, and quantized_kv_start parameters for improved sampling control. - Enhanced stream_generate function calls in both models to utilize new quantization settings.

feat: implement KV cache quantization settings in MLX_LM and MLX_VLM

a414879

- Added kv_bits, kv_group_size, and quantized_kv_start parameters to enhance KV cache quantization. - Updated initialization and request handling to incorporate new quantization settings for both models.

cubist38 mentioned this pull request Apr 8, 2026

Turboquant cache for multimodal #276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/kv cache quantization#278

Feature/kv cache quantization#278
cubist38 wants to merge 3 commits intomainfrom
feature/kv-cache-quantization

cubist38 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cubist38 commented Apr 8, 2026

Summary

Usage

CLI

YAML config

Changed files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant