Thread-local generation stream (port mlx-lm#1090)#1050
Merged
Conversation
Switch generation_stream to mx.new_thread_local_stream and let BatchGenerator accept a stream= kwarg, so the server can pass the generator thread's default stream explicitly. Keeps generation and synchronization on the same stream. Requires mlx>=0.31.2 (for mx.new_thread_local_stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness. - Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses. - Added error handling for model loading failures, allowing for graceful degradation. - Removed direct model loading from get_cached_model, streamlining the initialization process. This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.
afanty2021
added a commit
to afanty2021/mlx-vlm
that referenced
this pull request
Apr 24, 2026
Merge changes from upstream: - Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization - Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance) - Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090) - Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes Conflict resolution: - requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0 to maintain omlx compatibility while accepting mlx update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generation_streamswitched tomx.new_thread_local_stream(mx.default_device()).BatchGeneratornow accepts astream=kwarg (other args made keyword-only) and routeswired_limit,remove(), andnext()throughself._stream; exposes a.streamproperty.server.py: drops the module-level import and creates a localmx.default_stream(mx.default_device())inside_run()and_run_speculative(), passing it toBatchGenerator(stream=...)so generation and synchronization run on the generator thread's default stream.mlx>=0.31.2andmlx-lm>=0.31.3inrequirements.txt(new_thread_local_streamrequires MLX core 0.31.2).Test plan
python -c "from mlx_vlm import generate, server"imports cleanly on mlx 0.31.2.pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py— BatchGenerator call sites already use kwargs pastmodel, processor.mlx_vlm.serverand issue a multi-request load; confirm generation still completes and stays on a single stream._run_speculative) with a draft model if available.🤖 Generated with Claude Code