Add server json_schema response_format support#1047
Add server json_schema response_format support#1047avbiswas wants to merge 9 commits intoBlaizzy:mainfrom
json_schema response_format support#1047Conversation
|
Awesome work, there is an ongoing debate about this! Can we do it without adding new dependencies? Also, the changes are quite large. I'm thinking we already have logit processors, and logprobs so we can reuse those instead of duplicating |
Basically the flow is: forward pass -> model outputs logits -> we turn invalid ones to -inf (logit processor) -> sampler runs Given the above 2 requirements, would you say we can reuse the current logit processing/logprobs architecture in a better way than we are currently using? I will send an update removing the fat (with the CFG/Regex stuff and structured.py to minimize the changes further) |
|
Okay, I have deleted all the unnecessary stuff with CFG/Regex in the structured.py file
Edit: if change still looks "big", note that some of it is just pytest scripts. Can cut down on those too should you prefer. |
|
Ok, let's keep llguidance. It's pretty small and zero python sub deps. |
Blaizzy
left a comment
There was a problem hiding this comment.
It looks good, I just have some general improvement areas.
Also:
- Please update the readme with a section on this feature with examples.
- Docstring on LLGuidanceLogitsProcessor could note the expected input_ids / logits shapes (1D/2D, (vocab,) / (1, vocab)).
…rocessor integration
…en context anymore
Thanks! I (think I) have addressed all your comments. Awaiting your response. Readme also updated: 3038d30 The curl examples are a bit lengthy coz json schemas are kinda long to write. Feel free to trim or change the readme (or adjust location/verbiage) according to your taste. Could add an example file if you want - but since it's mostly Openai API calling stuff, I have avoided it. I also did not add multiple curl request examples coz it woulda bloat the readme. Let me know if I missed any or further changes are required. Thanks. |
Signed-off-by: Avishek Biswas <sudavivi@gmail.com>
|
Hey @Blaizzy I think I made all the updates you requested. Hope you got the notification! Thanks :) |
Summary
Edge-to-small models struggle to generate correct structured outputs, especially when they are nested json schemas/lists.
This PR adds server-mode support for OpenAI-style structured outputs using:
{ "response_format": { "type": "json_schema", "json_schema": { "name": "result", "strict": true, "schema": {} } } }This PR uses llguidance + hijacks the logit_processor to do the same.
The goal is narrow: support
response_format.type == "json_schema"in the HTTP server while preserving the current continuous batching implementation.Only tested with: mlx-community/Qwen3.5-4B-MLX-4bit
Tested with image inputs and text-only inputs.
Note: If user passes no json_schema, it preserves the current flow. I.e. none of the changes clash with current flow of request processing.
Shoutouts to Kimi-K2.6 and GPT-5.4 for executing the PR.
Scope
Included:
response_format: {"type": "json_schema", ...}in/v1/chat/completions.text.format: {"type": "json_schema", ...}.llguidance.Not included:
json_objectmode ('json_schema' only supported)Approach
The implementation is intentionally small and follows the existing generation architecture.
1. Structured logits processor
A new
mlx_vlm.structuredmodule addsLLGuidanceLogitsProcessor, backed byllguidance.I have added llguidance as a dependency (note: the pydantic dependency mlx-vlm already had did not include llguidance)
The processor:
llguidancegrammar,clone()so batch entries do not share mutable matcher state.2. Server response_format parsing
The server extracts JSON Schema from:
(Example)
{ "response_format": { "type": "json_schema", "json_schema": { "name": "animal_result", "strict": true, "schema": { "type": "object", "properties": { "animal": {"type": "string"} }, "required": ["animal"], "additionalProperties": false } } } }{ "text": { "format": { "type": "json_schema", "name": "animal_result", "strict": true, "schema": { "type": "object", "properties": { "animal": {"type": "string"} }, "required": ["animal"], "additionalProperties": false } } } }Unsupported response format types should return a clear error.
How to Run
Start the server:
cd mlx-vlm python -m mlx_vlm.server \ --model mlx-community/Qwen3.5-4B-MLX-4bit \ --host localhost \ --port 8080Send a structured Chat Completions request:
Example Output
The response message content is a JSON object matching the schema:
{ "animal": "dog", "species": "Canis lupus familiaris", "habitat": "urban", "characteristics": [ "loyal", "social", "domesticated" ], "description": "A domesticated carnivore known for its companionship with humans." }Pydantic Example
Curl thing is complicated, in general devs just use something like pydantic.
Applications can define their response contract as a Pydantic model, convert it to JSON Schema, send that schema in
response_format, and validate the returned content with the same model.For image input, the same schema flow is used with an image URL payload:
{ "role": "user", "content": [ { "type": "text", "text": "Identify the main animal in this image. Return only the structured object." }, { "type": "image_url", "image_url": { "url": "/path/to/dog.jpeg" } } ] }Example validated image output from local testing:
{ "think": "The user wants me to identify the main animal in the image. Looking at the image", "animal": "The image shows a small, furry", "species": "dog", "habitat": "desert", "characteristics": [ "golden fur", "floppy ears", "black nose", "orange collar", "puppy" ], "description": "A golden retriever puppy sitting on a white surface." }Note: the image output above is schema-valid but not semantically perfect. The purpose of this PR is constrained JSON generation and batching preservation, not semantic extraction quality.
Validation
Testing was done locally with:
mlx-community/Qwen3.5-4B-MLX-4bitUnit Tests
Focused tests:
Result:
Affected test files:
Result:
The warnings were unrelated SWIG deprecation warnings from imports.
Manual Server Validation
Structured text request with Pydantic validation:
Structured image request with Pydantic validation:
The validation harness converted a Pydantic
BaseModelto JSON Schema withmodel_json_schema(), passed that schema viaresponse_format, and validated returned content withmodel_validate_json().Comparison
Local benchmark runs compared structured output against prompt-only JSON instructions.
Structured output:
Prompt-only output:
max_tokens,Small local smoke test with two text requests:
This comparison is intended as local validation only. The PR does not claim universal speedups across models or hardware.
Future Work
json_objectmode separately if users want full JSON mode compatibility.