Skip to content

[llmapi] NemotronV3ReasoningParser returns empty content when enable_thinking=False #14502

@zentradev-rabih

Description

@zentradev-rabih

System Info

  • TensorRT-LLM version: 1.3.0rc9
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (revision 4f0cf9d)
  • Reasoning parser: --reasoning_parser nano-v3
  • Tool parser: --tool_parser qwen3_coder
  • Backend: pytorch
  • GPU: NVIDIA RTX PRO 6000 Blackwell (96 GB, SM120)
  • Inference cmd: trtllm-serve serve <model> --backend pytorch --reasoning_parser nano-v3 --tool_parser qwen3_coder --enable_chunked_prefill --trust_remote_code ...

Description

NemotronV3ReasoningParser (registered as nano-v3 in tensorrt_llm/llmapi/reasoning_parser.py) returns empty content when a request is sent with chat_template_kwargs: {"enable_thinking": false} AND the model leaks tokens into the reasoning stream OR fails to emit a closing </think> tag.

The parser inherits DeepSeekR1Parser's default behavior of routing tokens to reasoning_content until a </think> is seen. In thinking-disabled mode the model shouldn't be producing reasoning at all — but in practice it occasionally does (model behavior, not parser behavior). The current parser then returns:

  • content: empty string
  • reasoning_content: the actual answer

OpenAI-compatible chat clients render only content, so the user sees a blank response.

Why this is a bug, not a feature

NemotronV3ReasoningParser already has a force_nonempty_content mechanism that does exactly the right thing — but it requires the caller to opt in via chat_template_kwargs.force_nonempty_content=True. That's a fine power-user toggle, but it doesn't match the documented intent of enable_thinking=False.

For comparison, the HuggingFace model card for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ships a vLLM reasoning parser (super_v3_reasoning_parser.py, revision 4f0cf9d) that triggers the same swap when EITHER force_nonempty_content=True OR enable_thinking=False. So the TRT-LLM parser is out of sync with the model's documented inference behavior.

Reproduction (occurs ~5-15% of the time in our environment)

# Spin up trtllm-serve with Nemotron-3-Super-120B-NVFP4 + --reasoning_parser nano-v3
for i in $(seq 1 20); do
  curl -ksf https://localhost:8443/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"nemotron-super-120b-nvfp4",
         "messages":[{"role":"user","content":"What is 17 * 23?"}],
         "max_tokens":80,
         "chat_template_kwargs":{"enable_thinking":false}}' \
    | jq -r '.choices[0].message.content' &
done
wait
# Count empty responses

Without the fix, 1-3 of 20 responses come back with empty content while reasoning_content has the answer. With the proposed fix, 20/20 return content.

Proposed fix

Extend the swap-gate in NemotronV3ReasoningParser so it ALSO fires when chat_template_kwargs.enable_thinking is False. Same semantics as force_nonempty_content=True, just gated on an additional condition.

Four touch points, all inside the same class in tensorrt_llm/llmapi/reasoning_parser.py:

  1. __init__: read enable_thinking flag, store as self._enable_thinking_is_false
  2. _maybe_swap_content: extend gate to (self._force_nonempty_content or self._enable_thinking_is_false)
  3. finish: same gate extension on the missing-closing-tag branch
  4. parse_delta: same gate extension on the accumulator

Behavior when thinking is ENABLED — unchanged.
Behavior when force_nonempty_content=True — unchanged.
The new branch only fires when enable_thinking=False AND the parser would otherwise return empty content.

47-line unified diff (no incidental edits — strictly within NemotronV3ReasoningParser) ready to submit. Happy to open a PR with unit tests as soon as this issue is approved.

Worth considering during review

  • Should the fix also extend to KimiK2ReasoningParser (also inherits DeepSeekR1Parser)? Kimi K2 doesn't expose an enable_thinking chat-template flag today, so unclear if it has the equivalent problem — happy to investigate as a follow-up.
  • Alternative API: instead of adding a new instance var, the constructor could just default self._force_nonempty_content = True when enable_thinking is False. Less code, same effect, but force_nonempty_content then becomes overloaded with two meanings. Open to either direction.

Environment evidence

Patched our local install of tensorrt_llm/llmapi/reasoning_parser.py with the proposed fix on 2026-05-21. Smoke verified: PING/PONG and a 17×23 prompt both return non-empty content with enable_thinking=False. Behavior with enable_thinking=True unchanged (clean <think>...</think> framing, correct math answer). No regression in force_nonempty_content=True path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.Pytorch<NV>Pytorch backend related issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions