Add server `json_schema` response_format support by avbiswas · Pull Request #1047 · Blaizzy/mlx-vlm

avbiswas · 2026-04-22T15:29:35Z

Summary

Edge-to-small models struggle to generate correct structured outputs, especially when they are nested json schemas/lists.
This PR adds server-mode support for OpenAI-style structured outputs using:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "result",
      "strict": true,
      "schema": {}
    }
  }
}

This PR uses llguidance + hijacks the logit_processor to do the same.

The goal is narrow: support response_format.type == "json_schema" in the HTTP server while preserving the current continuous batching implementation.

Only tested with: mlx-community/Qwen3.5-4B-MLX-4bit

Tested with image inputs and text-only inputs.

Note: If user passes no json_schema, it preserves the current flow. I.e. none of the changes clash with current flow of request processing.

Shoutouts to Kimi-K2.6 and GPT-5.4 for executing the PR.

Scope

Included:

Parse response_format: {"type": "json_schema", ...} in /v1/chat/completions.
Parse Responses API-style text.format: {"type": "json_schema", ...}.
Build a JSON Schema constrained logits processor using llguidance.
Pass logits processors through the existing server generation and batching path.
Preserve per-sequence logits processor state while continuous batching is active.
Add tests for server parsing and batching-sensitive logits processor plumbing.

Not included:

OpenAIjson_object mode ('json_schema' only supported)
Speculative decoding with structured outputs.
Model-specific grammars for reasoning/thinking phases!

Approach

The implementation is intentionally small and follows the existing generation architecture.

1. Structured logits processor

A new mlx_vlm.structured module adds LLGuidanceLogitsProcessor, backed by llguidance.
I have added llguidance as a dependency (note: the pydantic dependency mlx-vlm already had did not include llguidance)

The processor:

converts a JSON Schema into an llguidance grammar,
applies a token bitmask to logits before sampling,
tracks matcher state across generated tokens,
supports clone() so batch entries do not share mutable matcher state.

2. Server response_format parsing

The server extracts JSON Schema from:

Chat Completions:
(Example)

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "animal_result",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "animal": {"type": "string"}
        },
        "required": ["animal"],
        "additionalProperties": false
      }
    }
  }
}

Responses API-style text format:

{
  "text": {
    "format": {
      "type": "json_schema",
      "name": "animal_result",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "animal": {"type": "string"}
        },
        "required": ["animal"],
        "additionalProperties": false
      }
    }
  }
}

Unsupported response format types should return a clear error.

How to Run

Start the server:

cd mlx-vlm
python -m mlx_vlm.server \
  --model mlx-community/Qwen3.5-4B-MLX-4bit \
  --host localhost \
  --port 8080

Send a structured Chat Completions request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.5-4B-MLX-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Pick one animal. Return only the structured object."
      }
    ],
    "max_tokens": 256,
    "temperature": 0,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "animal_result",
        "strict": true,
        "schema": {
          "type": "object",
          "properties": {
            "animal": {
              "type": "string",
              "maxLength": 30
            },
            "species": {
              "type": "string",
              "maxLength": 50
            },
            "habitat": {
              "type": "string",
              "enum": ["forest", "desert", "ocean", "urban", "unknown"]
            },
            "characteristics": {
              "type": "array",
              "items": {
                "type": "string",
                "maxLength": 30
              },
              "maxItems": 5
            },
            "description": {
              "type": "string",
              "maxLength": 300
            }
          },
          "required": [
            "animal",
            "species",
            "habitat",
            "characteristics",
            "description"
          ],
          "additionalProperties": false
        }
      }
    }
  }'

Example Output

The response message content is a JSON object matching the schema:

{
  "animal": "dog",
  "species": "Canis lupus familiaris",
  "habitat": "urban",
  "characteristics": [
    "loyal",
    "social",
    "domesticated"
  ],
  "description": "A domesticated carnivore known for its companionship with humans."
}

Pydantic Example

Curl thing is complicated, in general devs just use something like pydantic.

Applications can define their response contract as a Pydantic model, convert it to JSON Schema, send that schema in response_format, and validate the returned content with the same model.

import json
from typing import Literal
from urllib import request

from pydantic import BaseModel, ConfigDict, Field


class AnimalResult(BaseModel):
    model_config = ConfigDict(extra="forbid")

    animal: str = Field(max_length=30)
    species: str = Field(max_length=50)
    habitat: Literal["forest", "desert", "ocean", "urban", "unknown"]
    characteristics: list[str] = Field(max_length=5)
    description: str = Field(max_length=300)


payload = {
    "model": "mlx-community/Qwen3.5-4B-MLX-4bit",
    "messages": [
        {
            "role": "user",
            "content": "Pick one animal. Return only the structured object.",
        }
    ],
    "max_tokens": 256,
    "temperature": 0,
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "animal_result",
            "strict": True,
            "schema": AnimalResult.model_json_schema(),
        },
    },
}

req = request.Request(
    "http://localhost:8080/v1/chat/completions",
    data=json.dumps(payload).encode("utf-8"),
    headers={"Content-Type": "application/json"},
    method="POST",
)

with request.urlopen(req, timeout=120) as resp:
    body = json.loads(resp.read().decode("utf-8"))

content = body["choices"][0]["message"]["content"]
result = AnimalResult.model_validate_json(content)

print(result)

For image input, the same schema flow is used with an image URL payload:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Identify the main animal in this image. Return only the structured object."
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "/path/to/dog.jpeg"
      }
    }
  ]
}

Example validated image output from local testing:

{
  "think": "The user wants me to identify the main animal in the image. Looking at the image",
  "animal": "The image shows a small, furry",
  "species": "dog",
  "habitat": "desert",
  "characteristics": [
    "golden fur",
    "floppy ears",
    "black nose",
    "orange collar",
    "puppy"
  ],
  "description": "A golden retriever puppy sitting on a white surface."
}

Note: the image output above is schema-valid but not semantically perfect. The purpose of this PR is constrained JSON generation and batching preservation, not semantic extraction quality.

Validation

Testing was done locally with:

Model: mlx-community/Qwen3.5-4B-MLX-4bit
Hardware: Apple M2 Max 32GB

Unit Tests

Focused tests:

uv run python -m pytest \
  mlx_vlm/tests/test_generate.py::TestBatchGenerator::test_generation_batch_applies_per_sequence_logits_processors \
  mlx_vlm/tests/test_server.py::TestResponseGenerator::test_generate_arguments_to_generate_kwargs \
  mlx_vlm/tests/test_server.py::TestResponseGenerator::test_extract_chat_response_format_json_schema \
  mlx_vlm/tests/test_server.py::TestResponseGenerator::test_extract_responses_text_format_json_schema \
  mlx_vlm/tests/test_server.py::TestResponseGenerator::test_build_structured_logits_processors_uses_tokenizer

Result:

5 passed, 2 warnings in 2.87s

Affected test files:

uv run python -m pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_server.py

Result:

87 passed, 2 warnings in 2.30s

The warnings were unrelated SWIG deprecation warnings from imports.

Manual Server Validation

Structured text request with Pydantic validation:

requests: 1
mode: sequential
schema_model: AnimalResult
valid=True
valid_pydantic: 1/1
total_elapsed: 4.48s

Structured image request with Pydantic validation:

requests: 1
mode: sequential
schema_model: AnimalResult
valid=True
valid_pydantic: 1/1
total_elapsed: 5.90s

The validation harness converted a Pydantic BaseModel to JSON Schema with model_json_schema(), passed that schema via response_format, and validated returned content with model_validate_json().

Comparison

Local benchmark runs compared structured output against prompt-only JSON instructions.

Structured output:

returned parseable JSON,
passed Pydantic validation for bounded schemas,
completed earlier because the grammar constrained output to the target object.

Prompt-only output:

often emitted reasoning text before JSON,
frequently exceeded max_tokens,
failed JSON/Pydantic validation.

Small local smoke test with two text requests:

Mode	Schema	Total Time	Validation
Sequential	Yes	6.08s	valid JSON
Concurrent	Yes	5.64s	valid JSON
Sequential	No	15.80s	invalid/truncated
Concurrent	No	12.00s	invalid/truncated

This comparison is intended as local validation only. The PR does not claim universal speedups across models or hardware.

Future Work

Thinking aware structured output generation (currently one should add a "thinking" attribute to the json schema to artificially trigger a chain of thought)
Add json_object mode separately if users want full JSON mode compatibility.
Support batch + structured out in non-server mode (I have this running locally, but did not want to send a bloated PR)

Blaizzy · 2026-04-22T16:30:37Z

Awesome work, there is an ongoing debate about this!

Can we do it without adding new dependencies?

Also, the changes are quite large.

I'm thinking we already have logit processors, and logprobs so we can reuse those instead of duplicating

avbiswas · 2026-04-22T16:52:19Z

We can get rid of some of the additional grammar support (like CFG/regex) in this pass, and just focus on structured outputs with json schema.
I can raise a seperate PR at a later date to support regex and CFG/other ones. So almost the entire "structured.py" file can be removed.
Regarding the llguidance requirements, it does help a ton to get those grammars from the different formats, so the PR will get larger if we tried to include those in (especially if we were to add CFG/regex as future object types). Let me know if that's an issue.
So regarding the logit processors, consider these two requirements for this PR to work:

So we need per-request logit processors because different requests can have different object schemas (or lack of it). Request A and request B can be at seperate points in the decoding process, so they can't share the same processor instance.
So we need to do the logit processor/token masking stuff AFTER the model outputs logits and BEFORE the sampler runs.

Basically the flow is: forward pass -> model outputs logits -> we turn invalid ones to -inf (logit processor) -> sampler runs

Given the above 2 requirements, would you say we can reuse the current logit processing/logprobs architecture in a better way than we are currently using?

I will send an update removing the fat (with the CFG/Regex stuff and structured.py to minimize the changes further)

avbiswas · 2026-04-22T17:15:54Z

Okay, I have deleted all the unnecessary stuff with CFG/Regex in the structured.py file

Let me know if you have comments about the earlier question about the per-request logit processing stuff.
This change still keeps llguidance as a dep... let me know if that's okay. if all we did was json schema, then we can do without llguidance. But if we were to include other object types later on, then having llguidance will reduce overall code coz they already implement it. Let me know - your call!

Edit: if change still looks "big", note that some of it is just pytest scripts. Can cut down on those too should you prefer.

Blaizzy · 2026-04-22T23:17:55Z

Ok, let's keep llguidance. It's pretty small and zero python sub deps.

Blaizzy

It looks good, I just have some general improvement areas.

Also:

Please update the readme with a section on this feature with examples.
Docstring on LLGuidanceLogitsProcessor could note the expected input_ids / logits shapes (1D/2D, (vocab,) / (1, vocab)).

…herwise

…rocessor integration

…en context anymore

avbiswas · 2026-04-23T03:21:53Z

It looks good, I just have some general improvement areas.

Also:

Please update the readme with a section on this feature with examples.

Docstring on LLGuidanceLogitsProcessor could note the expected input_ids / logits shapes (1D/2D, (vocab,) / (1, vocab)).

Thanks! I (think I) have addressed all your comments. Awaiting your response.

Readme also updated: 3038d30

The curl examples are a bit lengthy coz json schemas are kinda long to write. Feel free to trim or change the readme (or adjust location/verbiage) according to your taste.

Could add an example file if you want - but since it's mostly Openai API calling stuff, I have avoided it. I also did not add multiple curl request examples coz it woulda bloat the readme.

Let me know if I missed any or further changes are required. Thanks.

Signed-off-by: Avishek Biswas <sudavivi@gmail.com>

avbiswas · 2026-04-24T13:13:29Z

Hey @Blaizzy I think I made all the updates you requested. Hope you got the notification! Thanks :)

response_format support with llguidance

eb192c6

trim down structured.py to only support json schema

61027a7