Skip to content

Thread-local generation stream (port mlx-lm#1090)#1050

Merged
Blaizzy merged 3 commits intomainfrom
pc/thread-local-generation-stream
Apr 24, 2026
Merged

Thread-local generation stream (port mlx-lm#1090)#1050
Blaizzy merged 3 commits intomainfrom
pc/thread-local-generation-stream

Conversation

@Blaizzy
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy commented Apr 22, 2026

Summary

  • Ports the thread-local generation stream changes from mlx-lm#1090 into mlx-vlm.
  • Module-level generation_stream switched to mx.new_thread_local_stream(mx.default_device()).
  • BatchGenerator now accepts a stream= kwarg (other args made keyword-only) and routes wired_limit, remove(), and next() through self._stream; exposes a .stream property.
  • server.py: drops the module-level import and creates a local mx.default_stream(mx.default_device()) inside _run() and _run_speculative(), passing it to BatchGenerator(stream=...) so generation and synchronization run on the generator thread's default stream.
  • Bumps mlx>=0.31.2 and mlx-lm>=0.31.3 in requirements.txt (new_thread_local_stream requires MLX core 0.31.2).

Test plan

  • python -c "from mlx_vlm import generate, server" imports cleanly on mlx 0.31.2.
  • Run pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py — BatchGenerator call sites already use kwargs past model, processor.
  • Start mlx_vlm.server and issue a multi-request load; confirm generation still completes and stays on a single stream.
  • Exercise the speculative path (_run_speculative) with a draft model if available.

🤖 Generated with Claude Code

Switch generation_stream to mx.new_thread_local_stream and let
BatchGenerator accept a stream= kwarg, so the server can pass the
generator thread's default stream explicitly. Keeps generation and
synchronization on the same stream.

Requires mlx>=0.31.2 (for mx.new_thread_local_stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness.
- Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses.
- Added error handling for model loading failures, allowing for graceful degradation.
- Removed direct model loading from get_cached_model, streamlining the initialization process.

This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.
@Blaizzy Blaizzy merged commit 728fab1 into main Apr 24, 2026
1 check passed
afanty2021 added a commit to afanty2021/mlx-vlm that referenced this pull request Apr 24, 2026
Merge changes from upstream:
- Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization
- Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance)
- Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090)
- Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes

Conflict resolution:
- requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0
  to maintain omlx compatibility while accepting mlx update

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on mlx 0.31.2: 'There is no Stream(gpu, N) in current thread' when generate() runs in a worker thread

1 participant