Skip to content

Support OpenAI stop sequences in server#1069

Open
eloe wants to merge 1 commit intoBlaizzy:mainfrom
eloe:codex/issue-1044-stop
Open

Support OpenAI stop sequences in server#1069
eloe wants to merge 1 commit intoBlaizzy:mainfrom
eloe:codex/issue-1044-stop

Conversation

@eloe
Copy link
Copy Markdown

@eloe eloe commented Apr 25, 2026

Summary

Fixes #1044.

This adds OpenAI-compatible stop handling to the MLX-VLM server for /chat/completions, plus matching mlx-vlm compatibility support for /responses so both server endpoints honor the same caller intent.

The implementation normalizes stop as either a string or a list of one to four non-empty strings, then incrementally filters decoded text so stop sequences are trimmed even when they span token/chunk boundaries. It is wired through continuous batching, speculative decoding, and non-batching fallback paths. When a server-side stop sequence is matched, generation is cancelled/removed from the active batch and the response reports finish_reason: "stop". When a requested stop sequence is not emitted before generation exhausts, chat completions report finish_reason: "length".

I also fixed a related Responses streaming consistency issue discovered while testing: response.completed.response.output_text now matches the final trimmed response.output_text.done text.

OpenAI spec validation

Validated against the current official OpenAI documented OpenAPI spec at https://app.stainless.com/api/spec/documented/openai/openapi.documented.yml.

  • Chat Completions defines stop via StopConfiguration: nullable string or array with minItems: 1 and maxItems: 4; returned text must not contain the stop sequence.
  • Chat completion finish_reason includes stop for a natural stop or provided stop sequence, and length when the token limit is reached.
  • Streaming chat examples finish with a final chunk whose delta is empty and whose finish_reason is stop.
  • The current OpenAI CreateResponse schema does not define a top-level stop parameter, so /responses handling here is intentionally treated as mlx-vlm compatibility behavior rather than a claim that OpenAI Responses currently accepts top-level stop.

Validation

Automated tests:

uv run --with pytest python -m pytest mlx_vlm/tests/test_server.py
# 59 passed, 3 warnings

Additional checks:

python3 -m compileall -q mlx_vlm/server.py mlx_vlm/tests/test_server.py
git diff --check

Live worktree integration validation against mlx-community/Qwen3.6-35B-A3B-nvfp4 on the local MLX-VLM server:

  • Chat non-streaming string stop: passed
  • Chat streaming string stop: passed
  • Responses non-streaming string stop: passed
  • Responses streaming string stop: passed
  • Chat newline stop: passed
  • Responses newline stop: passed
  • No matching stop reaches length: passed
  • List stop chooses earliest emitted stop: passed
  • Rejects more than four stop strings: passed
  • Rejects empty stop string: passed

Integration sweep result: 10 passed, 0 failed.

Broader repo note: pytest -q still fails during collection on an unrelated existing mismatch in mlx_vlm/tests/test_utils.py, which imports get_class_predicate from mlx_vlm.utils although that symbol is not present.

@eloe eloe marked this pull request as ready for review April 25, 2026 04:42
@eloe eloe marked this pull request as draft April 25, 2026 04:49
@eloe eloe force-pushed the codex/issue-1044-stop branch from 1be6bc1 to 740da67 Compare April 25, 2026 05:22
@eloe eloe force-pushed the codex/issue-1044-stop branch from 740da67 to e276f49 Compare April 25, 2026 05:26
@eloe eloe marked this pull request as ready for review April 25, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI-compatible server accepts stop but silently ignores it on /chat/completions and /responses

1 participant