Continuous Batching

Continuous batching enables higher throughput when serving multiple concurrent users.

Enabling Continuous Batching

vllm-mlx serve mlx-community/Qwen3-0.6B-8bit --continuous-batching

With Paged Cache

For memory-efficient prefix sharing:

vllm-mlx serve mlx-community/Qwen3-0.6B-8bit --continuous-batching --use-paged-cache

How It Works

Simple Mode (Default)

One request at a time
Maximum throughput for single user
No overhead from batching

Continuous Batching Mode

Multiple requests processed together
Better throughput for concurrent users
Small overhead per request

Paged Cache

KV cache stored in fixed-size blocks
Shared system prompts use same blocks
Memory savings: 80%+ for 10+ concurrent users

Performance Results

Continuous Batching Results (M4 Max, 128GB):

Model	Single Request	Batch (5 req)	Speedup
Llama-3.2-1B-Instruct-4bit	299.1 tok/s	613.0 tok/s	2.05x
Llama-3.2-3B-Instruct-4bit	137.6 tok/s	208.1 tok/s	1.51x
Qwen3-0.6B-8bit	328.1 tok/s	1111.8 tok/s	3.39x
Qwen3-30B-A3B-4bit	98.1 tok/s	233.3 tok/s	2.38x
Qwen2.5-1.5B-Instruct-4bit	196.9 tok/s	322.2 tok/s	1.64x

Batching 5 concurrent requests shows 1.5-3x throughput improvement.

Streaming Performance

Streaming Performance (M4 Max, 128GB):

Model	TTFT	Generation Speed
Llama-3.2-1B-Instruct-4bit	~4.6ms	218.9 tok/s
Llama-3.2-3B-Instruct-4bit	~10.7ms	93.6 tok/s
Qwen3-0.6B-8bit	~3.0ms	328.5 tok/s
Qwen3-30B-A3B-4bit	~10.2ms	98.4 tok/s
Qwen2.5-1.5B-Instruct-4bit	~7.1ms	140.3 tok/s

TTFT = Time to First Token

Streaming Configuration

Control token delivery with --stream-interval:

# Every token (smoothest)
vllm-mlx serve model --continuous-batching --stream-interval 1

# Batch tokens (better for high-latency)
vllm-mlx serve model --continuous-batching --stream-interval 5

Value	Behavior
`1`	Send every token immediately
`2-5`	Batch tokens before sending
`10+`	Maximum throughput, chunkier output

Memory Management

For large models, the prefix cache can consume significant memory. The memory-aware cache automatically manages this:

# Auto-detect (uses 20% of available RAM)
vllm-mlx serve model --continuous-batching

# Explicit limit
vllm-mlx serve model --continuous-batching --cache-memory-mb 2048

# Custom percentage
vllm-mlx serve model --continuous-batching --cache-memory-percent 0.10

Option	Description
`--cache-memory-mb`	Set explicit limit in MB
`--cache-memory-percent`	Fraction of available RAM (default: 0.20)
`--no-memory-aware-cache`	Use legacy entry-count based cache

Prefix Cache

Prefix caching reuses KV cache for repeated prompts.

How It Works

User 1: System prompt (500 tokens) → Creates 8 blocks
User 2: Same system prompt → Shares 8 blocks (ref_count++)
User N: Same system prompt → Shares 8 blocks (ref_count++)

Memory savings: 80%+ for 10+ concurrent users

Cache Key Strategy

LLM: hash(prompt)
Images: hash(image_content) + hash(prompt)
Videos: hash(video_path) + hash(fps) + hash(max_frames) + hash(prompt)

Testing Prefix Cache

python tests/test_prefix_cache.py

======================================================================
  LLM PREFIX CACHE TEST
======================================================================
  Model: mlx-community/Qwen3-0.6B-8bit
  Expected behavior:
    - Same prompt → cache HIT
    - Different prompt → cache MISS or PREFIX_HIT (shared template tokens)
----------------------------------------------------------------------
  Results:
  Step   | Description         | Expected | Actual | Status
  -------+---------------------+----------+--------+-------
  1a     | First request       | MISS     | MISS   | PASS
  1b     | Same prompt         | HIT      | HIT    | PASS
  1c     | Different prompt    | MISS     | MISS   | PASS
  1d     | Return to prompt 1  | HIT      | HIT    | PASS
======================================================================

Running Benchmarks

# Continuous batching benchmark
python tests/test_continuous_batching.py

# Prefix cache test
python tests/test_prefix_cache.py

When to Use

Scenario	Mode
Single user, maximum speed	Simple (default)
Multiple concurrent users	`--continuous-batching`
Large models (7B+)	`--continuous-batching --cache-memory-mb 2048`
Production with shared prompts	`--continuous-batching --use-paged-cache`

Production Setup

vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching \
  --use-paged-cache \
  --port 8000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Batching

Enabling Continuous Batching

With Paged Cache

How It Works

Simple Mode (Default)

Continuous Batching Mode

Paged Cache

Performance Results

Streaming Performance

Streaming Configuration

Memory Management

Prefix Cache

How It Works

Cache Key Strategy

Testing Prefix Cache

Running Benchmarks

When to Use

Production Setup

FilesExpand file tree

continuous-batching.md

Latest commit

History

continuous-batching.md

File metadata and controls

Continuous Batching

Enabling Continuous Batching

With Paged Cache

How It Works

Simple Mode (Default)

Continuous Batching Mode

Paged Cache

Performance Results

Streaming Performance

Streaming Configuration

Memory Management

Prefix Cache

How It Works

Cache Key Strategy

Testing Prefix Cache

Running Benchmarks

When to Use

Production Setup