Thread-local generation stream (port mlx-lm#1090) by Blaizzy · Pull Request #1050 · Blaizzy/mlx-vlm

Blaizzy · 2026-04-22T23:55:56Z

Summary

Ports the thread-local generation stream changes from mlx-lm#1090 into mlx-vlm.
Module-level generation_stream switched to mx.new_thread_local_stream(mx.default_device()).
BatchGenerator now accepts a stream= kwarg (other args made keyword-only) and routes wired_limit, remove(), and next() through self._stream; exposes a .stream property.
server.py: drops the module-level import and creates a local mx.default_stream(mx.default_device()) inside _run() and _run_speculative(), passing it to BatchGenerator(stream=...) so generation and synchronization run on the generator thread's default stream.
Bumps mlx>=0.31.2 and mlx-lm>=0.31.3 in requirements.txt (new_thread_local_stream requires MLX core 0.31.2).

Test plan

python -c "from mlx_vlm import generate, server" imports cleanly on mlx 0.31.2.
Run pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py — BatchGenerator call sites already use kwargs past model, processor.
Start mlx_vlm.server and issue a multi-request load; confirm generation still completes and stays on a single stream.
Exercise the speculative path (_run_speculative) with a draft model if available.

🤖 Generated with Claude Code

Switch generation_stream to mx.new_thread_local_stream and let BatchGenerator accept a stream= kwarg, so the server can pass the generator thread's default stream explicitly. Keeps generation and synchronization on the same stream. Requires mlx>=0.31.2 (for mx.new_thread_local_stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness. - Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses. - Added error handling for model loading failures, allowing for graceful degradation. - Removed direct model loading from get_cached_model, streamlining the initialization process. This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.

Merge changes from upstream: - Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization - Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance) - Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090) - Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes Conflict resolution: - requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0 to maintain omlx compatibility while accepting mlx update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Blaizzy mentioned this pull request Apr 23, 2026

fix: Use thread-local generation stream #1051

Closed

Blaizzy linked an issue Apr 23, 2026 that may be closed by this pull request

Crash on mlx 0.31.2: 'There is no Stream(gpu, N) in current thread' when generate() runs in a worker thread #1049

Closed

Merge branch 'main' into pc/thread-local-generation-stream

b1df54d

Blaizzy merged commit 728fab1 into main Apr 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread-local generation stream (port mlx-lm#1090)#1050

Thread-local generation stream (port mlx-lm#1090)#1050
Blaizzy merged 3 commits intomainfrom
pc/thread-local-generation-stream

Blaizzy commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Blaizzy commented Apr 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant