vllm-mlx supports reasoning models that show their thinking process before giving an answer. Models like Qwen3 and DeepSeek-R1 wrap their reasoning in <think>...</think> tags, and vllm-mlx can parse these tags to separate the reasoning from the final response.
When a reasoning model generates output, it typically looks like this:
<think>
Let me analyze this step by step.
First, I need to consider the constraints.
The answer should be a prime number less than 10.
Checking: 2, 3, 5, 7 are all prime and less than 10.
</think>
The prime numbers less than 10 are: 2, 3, 5, 7.
Without reasoning parsing, you get the raw output with the tags included. With reasoning parsing enabled, the thinking process and final answer are separated into distinct fields in the API response.
# For Qwen3 models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
# For DeepSeek-R1 models
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1When reasoning parsing is enabled, the API response includes a reasoning field:
Non-streaming response:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The prime numbers less than 10 are: 2, 3, 5, 7.",
"reasoning": "Let me analyze this step by step.\nFirst, I need to consider the constraints.\nThe answer should be a prime number less than 10.\nChecking: 2, 3, 5, 7 are all prime and less than 10."
}
}]
}Streaming response:
Chunks are sent separately for reasoning and content. During the reasoning phase, chunks have reasoning populated. When the model transitions to the final answer, chunks have content populated:
{"delta": {"reasoning": "Let me analyze"}}
{"delta": {"reasoning": " this step by step."}}
{"delta": {"reasoning": "\nFirst, I need to"}}
...
{"delta": {"content": "The prime"}}
{"delta": {"content": " numbers less than 10"}}
{"delta": {"content": " are: 2, 3, 5, 7."}}from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Non-streaming
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What are the prime numbers less than 10?"}]
)
message = response.choices[0].message
print("Reasoning:", message.reasoning) # The thinking process
print("Answer:", message.content) # The final answerreasoning_text = ""
content_text = ""
stream = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Solve: 2 + 2 = ?"}],
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta
if hasattr(delta, 'reasoning') and delta.reasoning:
reasoning_text += delta.reasoning
print(f"[Thinking] {delta.reasoning}", end="")
if delta.content:
content_text += delta.content
print(delta.content, end="")
print(f"\n\nFinal reasoning: {reasoning_text}")
print(f"Final answer: {content_text}")For Qwen3 models that use explicit <think> and </think> tags.
- Requires both opening and closing tags
- If tags are missing, output is treated as regular content
- Best for: Qwen3-0.6B, Qwen3-4B, Qwen3-8B and similar models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3For DeepSeek-R1 models that may omit the opening <think> tag.
- More lenient than Qwen3 parser
- Handles cases where
<think>is implicit - Content before
</think>is treated as reasoning even without<think>
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1The reasoning parser uses text-based detection to identify thinking tags in the model output. During streaming, it tracks the current position in the output to correctly route each token to either reasoning or content.
Model Output: <think>Step 1: analyze...</think>The answer is 42.
├─────────────────────┤├─────────────────────┤
Parsed: │ reasoning ││ content │
└─────────────────────┘└─────────────────────┘
The parsing is stateless and uses the accumulated text to determine context, making it robust for streaming scenarios where tokens may arrive in arbitrary chunks.
Reasoning models work best when you encourage step-by-step thinking:
messages = [
{"role": "system", "content": "Think through problems step by step before answering."},
{"role": "user", "content": "What is 17 × 23?"}
]Some prompts may not trigger reasoning. In these cases, reasoning will be None and all output goes to content:
message = response.choices[0].message
if message.reasoning:
print(f"Model's thought process: {message.reasoning}")
print(f"Answer: {message.content}")Lower temperatures tend to produce more consistent reasoning patterns:
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Explain quantum entanglement"}],
temperature=0.3 # More focused reasoning
)When --reasoning-parser is not specified, the server behaves as before:
- Thinking tags are included in the
contentfield - No
reasoningfield is added to responses
This ensures existing applications continue to work without changes.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
def solve_math(problem: str) -> dict:
"""Solve a math problem and return reasoning + answer."""
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a math tutor. Show your work."},
{"role": "user", "content": problem}
],
temperature=0.2
)
message = response.choices[0].message
return {
"problem": problem,
"work": message.reasoning,
"answer": message.content
}
result = solve_math("If a train travels 120 km in 2 hours, what is its average speed?")
print(f"Problem: {result['problem']}")
print(f"\nWork shown:\n{result['work']}")
print(f"\nFinal answer: {result['answer']}")curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "What is 15% of 80?"}]
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "What is 15% of 80?"}],
"stream": true
}'- Make sure you started the server with
--reasoning-parser - Check that the model actually uses thinking tags (not all prompts trigger reasoning)
- The model may not be using the expected tag format
- Try a different parser (
qwen3vsdeepseek_r1)
- Increase
--max-tokensif the model is hitting the token limit mid-thought
- Supported Models - Models that support reasoning
- Server Configuration - All server options
- CLI Reference - Command line options