Reasoning Models

vllm-mlx supports reasoning models that show their thinking process before giving an answer. Models like Qwen3 and DeepSeek-R1 wrap their reasoning in <think>...</think> tags, and vllm-mlx can parse these tags to separate the reasoning from the final response.

Why Use Reasoning Parsing?

When a reasoning model generates output, it typically looks like this:

<think>
Let me analyze this step by step.
First, I need to consider the constraints.
The answer should be a prime number less than 10.
Checking: 2, 3, 5, 7 are all prime and less than 10.
</think>
The prime numbers less than 10 are: 2, 3, 5, 7.

Without reasoning parsing, you get the raw output with the tags included. With reasoning parsing enabled, the thinking process and final answer are separated into distinct fields in the API response.

Getting Started

Start the Server with Reasoning Parser

# For Qwen3 models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# For DeepSeek-R1 models
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

API Response Format

When reasoning parsing is enabled, the API response includes a reasoning field:

Non-streaming response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The prime numbers less than 10 are: 2, 3, 5, 7.",
      "reasoning": "Let me analyze this step by step.\nFirst, I need to consider the constraints.\nThe answer should be a prime number less than 10.\nChecking: 2, 3, 5, 7 are all prime and less than 10."
    }
  }]
}

Streaming response:

Chunks are sent separately for reasoning and content. During the reasoning phase, chunks have reasoning populated. When the model transitions to the final answer, chunks have content populated:

{"delta": {"reasoning": "Let me analyze"}}
{"delta": {"reasoning": " this step by step."}}
{"delta": {"reasoning": "\nFirst, I need to"}}
...
{"delta": {"content": "The prime"}}
{"delta": {"content": " numbers less than 10"}}
{"delta": {"content": " are: 2, 3, 5, 7."}}

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Non-streaming
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What are the prime numbers less than 10?"}]
)

message = response.choices[0].message
print("Reasoning:", message.reasoning)  # The thinking process
print("Answer:", message.content)        # The final answer

Streaming with Reasoning

reasoning_text = ""
content_text = ""

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Solve: 2 + 2 = ?"}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, 'reasoning') and delta.reasoning:
        reasoning_text += delta.reasoning
        print(f"[Thinking] {delta.reasoning}", end="")
    if delta.content:
        content_text += delta.content
        print(delta.content, end="")

print(f"\n\nFinal reasoning: {reasoning_text}")
print(f"Final answer: {content_text}")

Supported Parsers

Qwen3 Parser (`qwen3`)

For Qwen3 models that use explicit <think> and </think> tags.

Requires both opening and closing tags
If tags are missing, output is treated as regular content
Best for: Qwen3-0.6B, Qwen3-4B, Qwen3-8B and similar models

vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

DeepSeek-R1 Parser (`deepseek_r1`)

For DeepSeek-R1 models that may omit the opening <think> tag.

More lenient than Qwen3 parser
Handles cases where <think> is implicit
Content before </think> is treated as reasoning even without <think>

vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

How It Works

The reasoning parser uses text-based detection to identify thinking tags in the model output. During streaming, it tracks the current position in the output to correctly route each token to either reasoning or content.

Model Output:        <think>Step 1: analyze...</think>The answer is 42.
                     ├─────────────────────┤├─────────────────────┤
Parsed:              │     reasoning       ││       content       │
                     └─────────────────────┘└─────────────────────┘

The parsing is stateless and uses the accumulated text to determine context, making it robust for streaming scenarios where tokens may arrive in arbitrary chunks.

Tips for Best Results

Prompting

Reasoning models work best when you encourage step-by-step thinking:

messages = [
    {"role": "system", "content": "Think through problems step by step before answering."},
    {"role": "user", "content": "What is 17 × 23?"}
]

Handling Missing Reasoning

Some prompts may not trigger reasoning. In these cases, reasoning will be None and all output goes to content:

message = response.choices[0].message
if message.reasoning:
    print(f"Model's thought process: {message.reasoning}")
print(f"Answer: {message.content}")

Temperature and Reasoning

Lower temperatures tend to produce more consistent reasoning patterns:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}],
    temperature=0.3  # More focused reasoning
)

Backward Compatibility

When --reasoning-parser is not specified, the server behaves as before:

Thinking tags are included in the content field
No reasoning field is added to responses

This ensures existing applications continue to work without changes.

Example: Math Problem Solver

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

def solve_math(problem: str) -> dict:
    """Solve a math problem and return reasoning + answer."""
    response = client.chat.completions.create(
        model="default",
        messages=[
            {"role": "system", "content": "You are a math tutor. Show your work."},
            {"role": "user", "content": problem}
        ],
        temperature=0.2
    )

    message = response.choices[0].message
    return {
        "problem": problem,
        "work": message.reasoning,
        "answer": message.content
    }

result = solve_math("If a train travels 120 km in 2 hours, what is its average speed?")
print(f"Problem: {result['problem']}")
print(f"\nWork shown:\n{result['work']}")
print(f"\nFinal answer: {result['answer']}")

Curl Examples

Non-streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is 15% of 80?"}]
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is 15% of 80?"}],
    "stream": true
  }'

Troubleshooting

No reasoning field in response

Make sure you started the server with --reasoning-parser
Check that the model actually uses thinking tags (not all prompts trigger reasoning)

Reasoning appears in content

The model may not be using the expected tag format
Try a different parser (qwen3 vs deepseek_r1)

Truncated reasoning

Increase --max-tokens if the model is hitting the token limit mid-thought

Supported Models - Models that support reasoning
Server Configuration - All server options
CLI Reference - Command line options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasoning Models

Why Use Reasoning Parsing?

Getting Started

Start the Server with Reasoning Parser

API Response Format

Using with OpenAI SDK

Streaming with Reasoning

Supported Parsers

Qwen3 Parser (`qwen3`)

DeepSeek-R1 Parser (`deepseek_r1`)

How It Works

Tips for Best Results

Prompting

Handling Missing Reasoning

Temperature and Reasoning

Backward Compatibility

Example: Math Problem Solver

Curl Examples

Non-streaming

Streaming

Troubleshooting

No reasoning field in response

Reasoning appears in content

Truncated reasoning

Related

FilesExpand file tree

reasoning.md

Latest commit

History

reasoning.md

File metadata and controls

Reasoning Models

Why Use Reasoning Parsing?

Getting Started

Start the Server with Reasoning Parser

API Response Format

Using with OpenAI SDK

Streaming with Reasoning

Supported Parsers

Qwen3 Parser (qwen3)

DeepSeek-R1 Parser (deepseek_r1)

How It Works

Tips for Best Results

Prompting

Handling Missing Reasoning

Temperature and Reasoning

Backward Compatibility

Example: Math Problem Solver

Curl Examples

Non-streaming

Streaming

Troubleshooting

No reasoning field in response

Reasoning appears in content

Truncated reasoning

Related

Qwen3 Parser (`qwen3`)

DeepSeek-R1 Parser (`deepseek_r1`)