Open
Conversation
1be6bc1 to
740da67
Compare
740da67 to
e276f49
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1044.
This adds OpenAI-compatible
stophandling to the MLX-VLM server for/chat/completions, plus matching mlx-vlm compatibility support for/responsesso both server endpoints honor the same caller intent.The implementation normalizes
stopas either a string or a list of one to four non-empty strings, then incrementally filters decoded text so stop sequences are trimmed even when they span token/chunk boundaries. It is wired through continuous batching, speculative decoding, and non-batching fallback paths. When a server-side stop sequence is matched, generation is cancelled/removed from the active batch and the response reportsfinish_reason: "stop". When a requested stop sequence is not emitted before generation exhausts, chat completions reportfinish_reason: "length".I also fixed a related Responses streaming consistency issue discovered while testing:
response.completed.response.output_textnow matches the final trimmedresponse.output_text.donetext.OpenAI spec validation
Validated against the current official OpenAI documented OpenAPI spec at
https://app.stainless.com/api/spec/documented/openai/openapi.documented.yml.stopviaStopConfiguration: nullable string or array withminItems: 1andmaxItems: 4; returned text must not contain the stop sequence.finish_reasonincludesstopfor a natural stop or provided stop sequence, andlengthwhen the token limit is reached.deltais empty and whosefinish_reasonisstop.CreateResponseschema does not define a top-levelstopparameter, so/responseshandling here is intentionally treated as mlx-vlm compatibility behavior rather than a claim that OpenAI Responses currently accepts top-levelstop.Validation
Automated tests:
uv run --with pytest python -m pytest mlx_vlm/tests/test_server.py # 59 passed, 3 warningsAdditional checks:
Live worktree integration validation against
mlx-community/Qwen3.6-35B-A3B-nvfp4on the local MLX-VLM server:length: passedIntegration sweep result:
10 passed, 0 failed.Broader repo note:
pytest -qstill fails during collection on an unrelated existing mismatch inmlx_vlm/tests/test_utils.py, which importsget_class_predicatefrommlx_vlm.utilsalthough that symbol is not present.