Skip to content

Latest commit

 

History

History
267 lines (195 loc) · 7.86 KB

File metadata and controls

267 lines (195 loc) · 7.86 KB

Reasoning Models

vllm-mlx supports reasoning models that show their thinking process before giving an answer. Models like Qwen3 and DeepSeek-R1 wrap their reasoning in <think>...</think> tags, and vllm-mlx can parse these tags to separate the reasoning from the final response.

Why Use Reasoning Parsing?

When a reasoning model generates output, it typically looks like this:

<think>
Let me analyze this step by step.
First, I need to consider the constraints.
The answer should be a prime number less than 10.
Checking: 2, 3, 5, 7 are all prime and less than 10.
</think>
The prime numbers less than 10 are: 2, 3, 5, 7.

Without reasoning parsing, you get the raw output with the tags included. With reasoning parsing enabled, the thinking process and final answer are separated into distinct fields in the API response.

Getting Started

Start the Server with Reasoning Parser

# For Qwen3 models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# For DeepSeek-R1 models
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

API Response Format

When reasoning parsing is enabled, the API response includes a reasoning field:

Non-streaming response:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The prime numbers less than 10 are: 2, 3, 5, 7.",
      "reasoning": "Let me analyze this step by step.\nFirst, I need to consider the constraints.\nThe answer should be a prime number less than 10.\nChecking: 2, 3, 5, 7 are all prime and less than 10."
    }
  }]
}

Streaming response:

Chunks are sent separately for reasoning and content. During the reasoning phase, chunks have reasoning populated. When the model transitions to the final answer, chunks have content populated:

{"delta": {"reasoning": "Let me analyze"}}
{"delta": {"reasoning": " this step by step."}}
{"delta": {"reasoning": "\nFirst, I need to"}}
...
{"delta": {"content": "The prime"}}
{"delta": {"content": " numbers less than 10"}}
{"delta": {"content": " are: 2, 3, 5, 7."}}

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Non-streaming
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What are the prime numbers less than 10?"}]
)

message = response.choices[0].message
print("Reasoning:", message.reasoning)  # The thinking process
print("Answer:", message.content)        # The final answer

Streaming with Reasoning

reasoning_text = ""
content_text = ""

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Solve: 2 + 2 = ?"}],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, 'reasoning') and delta.reasoning:
        reasoning_text += delta.reasoning
        print(f"[Thinking] {delta.reasoning}", end="")
    if delta.content:
        content_text += delta.content
        print(delta.content, end="")

print(f"\n\nFinal reasoning: {reasoning_text}")
print(f"Final answer: {content_text}")

Supported Parsers

Qwen3 Parser (qwen3)

For Qwen3 models that use explicit <think> and </think> tags.

  • Requires both opening and closing tags
  • If tags are missing, output is treated as regular content
  • Best for: Qwen3-0.6B, Qwen3-4B, Qwen3-8B and similar models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

DeepSeek-R1 Parser (deepseek_r1)

For DeepSeek-R1 models that may omit the opening <think> tag.

  • More lenient than Qwen3 parser
  • Handles cases where <think> is implicit
  • Content before </think> is treated as reasoning even without <think>
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

How It Works

The reasoning parser uses text-based detection to identify thinking tags in the model output. During streaming, it tracks the current position in the output to correctly route each token to either reasoning or content.

Model Output:        <think>Step 1: analyze...</think>The answer is 42.
                     ├─────────────────────┤├─────────────────────┤
Parsed:              │     reasoning       ││       content       │
                     └─────────────────────┘└─────────────────────┘

The parsing is stateless and uses the accumulated text to determine context, making it robust for streaming scenarios where tokens may arrive in arbitrary chunks.

Tips for Best Results

Prompting

Reasoning models work best when you encourage step-by-step thinking:

messages = [
    {"role": "system", "content": "Think through problems step by step before answering."},
    {"role": "user", "content": "What is 17 × 23?"}
]

Handling Missing Reasoning

Some prompts may not trigger reasoning. In these cases, reasoning will be None and all output goes to content:

message = response.choices[0].message
if message.reasoning:
    print(f"Model's thought process: {message.reasoning}")
print(f"Answer: {message.content}")

Temperature and Reasoning

Lower temperatures tend to produce more consistent reasoning patterns:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}],
    temperature=0.3  # More focused reasoning
)

Backward Compatibility

When --reasoning-parser is not specified, the server behaves as before:

  • Thinking tags are included in the content field
  • No reasoning field is added to responses

This ensures existing applications continue to work without changes.

Example: Math Problem Solver

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

def solve_math(problem: str) -> dict:
    """Solve a math problem and return reasoning + answer."""
    response = client.chat.completions.create(
        model="default",
        messages=[
            {"role": "system", "content": "You are a math tutor. Show your work."},
            {"role": "user", "content": problem}
        ],
        temperature=0.2
    )

    message = response.choices[0].message
    return {
        "problem": problem,
        "work": message.reasoning,
        "answer": message.content
    }

result = solve_math("If a train travels 120 km in 2 hours, what is its average speed?")
print(f"Problem: {result['problem']}")
print(f"\nWork shown:\n{result['work']}")
print(f"\nFinal answer: {result['answer']}")

Curl Examples

Non-streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is 15% of 80?"}]
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is 15% of 80?"}],
    "stream": true
  }'

Troubleshooting

No reasoning field in response

  • Make sure you started the server with --reasoning-parser
  • Check that the model actually uses thinking tags (not all prompts trigger reasoning)

Reasoning appears in content

  • The model may not be using the expected tag format
  • Try a different parser (qwen3 vs deepseek_r1)

Truncated reasoning

  • Increase --max-tokens if the model is hitting the token limit mid-thought

Related