System Info
- TensorRT-LLM version: 1.3.0rc9
- Model:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (revision 4f0cf9d)
- Reasoning parser:
--reasoning_parser nano-v3
- Tool parser:
--tool_parser qwen3_coder
- Backend: pytorch
- GPU: NVIDIA RTX PRO 6000 Blackwell (96 GB, SM120)
- Inference cmd:
trtllm-serve serve <model> --backend pytorch --reasoning_parser nano-v3 --tool_parser qwen3_coder --enable_chunked_prefill --trust_remote_code ...
Description
NemotronV3ReasoningParser (registered as nano-v3 in tensorrt_llm/llmapi/reasoning_parser.py) returns empty content when a request is sent with chat_template_kwargs: {"enable_thinking": false} AND the model leaks tokens into the reasoning stream OR fails to emit a closing </think> tag.
The parser inherits DeepSeekR1Parser's default behavior of routing tokens to reasoning_content until a </think> is seen. In thinking-disabled mode the model shouldn't be producing reasoning at all — but in practice it occasionally does (model behavior, not parser behavior). The current parser then returns:
content: empty string
reasoning_content: the actual answer
OpenAI-compatible chat clients render only content, so the user sees a blank response.
Why this is a bug, not a feature
NemotronV3ReasoningParser already has a force_nonempty_content mechanism that does exactly the right thing — but it requires the caller to opt in via chat_template_kwargs.force_nonempty_content=True. That's a fine power-user toggle, but it doesn't match the documented intent of enable_thinking=False.
For comparison, the HuggingFace model card for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ships a vLLM reasoning parser (super_v3_reasoning_parser.py, revision 4f0cf9d) that triggers the same swap when EITHER force_nonempty_content=True OR enable_thinking=False. So the TRT-LLM parser is out of sync with the model's documented inference behavior.
Reproduction (occurs ~5-15% of the time in our environment)
# Spin up trtllm-serve with Nemotron-3-Super-120B-NVFP4 + --reasoning_parser nano-v3
for i in $(seq 1 20); do
curl -ksf https://localhost:8443/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"nemotron-super-120b-nvfp4",
"messages":[{"role":"user","content":"What is 17 * 23?"}],
"max_tokens":80,
"chat_template_kwargs":{"enable_thinking":false}}' \
| jq -r '.choices[0].message.content' &
done
wait
# Count empty responses
Without the fix, 1-3 of 20 responses come back with empty content while reasoning_content has the answer. With the proposed fix, 20/20 return content.
Proposed fix
Extend the swap-gate in NemotronV3ReasoningParser so it ALSO fires when chat_template_kwargs.enable_thinking is False. Same semantics as force_nonempty_content=True, just gated on an additional condition.
Four touch points, all inside the same class in tensorrt_llm/llmapi/reasoning_parser.py:
__init__: read enable_thinking flag, store as self._enable_thinking_is_false
_maybe_swap_content: extend gate to (self._force_nonempty_content or self._enable_thinking_is_false)
finish: same gate extension on the missing-closing-tag branch
parse_delta: same gate extension on the accumulator
Behavior when thinking is ENABLED — unchanged.
Behavior when force_nonempty_content=True — unchanged.
The new branch only fires when enable_thinking=False AND the parser would otherwise return empty content.
47-line unified diff (no incidental edits — strictly within NemotronV3ReasoningParser) ready to submit. Happy to open a PR with unit tests as soon as this issue is approved.
Worth considering during review
- Should the fix also extend to
KimiK2ReasoningParser (also inherits DeepSeekR1Parser)? Kimi K2 doesn't expose an enable_thinking chat-template flag today, so unclear if it has the equivalent problem — happy to investigate as a follow-up.
- Alternative API: instead of adding a new instance var, the constructor could just default
self._force_nonempty_content = True when enable_thinking is False. Less code, same effect, but force_nonempty_content then becomes overloaded with two meanings. Open to either direction.
Environment evidence
Patched our local install of tensorrt_llm/llmapi/reasoning_parser.py with the proposed fix on 2026-05-21. Smoke verified: PING/PONG and a 17×23 prompt both return non-empty content with enable_thinking=False. Behavior with enable_thinking=True unchanged (clean <think>...</think> framing, correct math answer). No regression in force_nonempty_content=True path.
System Info
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4(revision4f0cf9d)--reasoning_parser nano-v3--tool_parser qwen3_codertrtllm-serve serve <model> --backend pytorch --reasoning_parser nano-v3 --tool_parser qwen3_coder --enable_chunked_prefill --trust_remote_code ...Description
NemotronV3ReasoningParser(registered asnano-v3intensorrt_llm/llmapi/reasoning_parser.py) returns emptycontentwhen a request is sent withchat_template_kwargs: {"enable_thinking": false}AND the model leaks tokens into the reasoning stream OR fails to emit a closing</think>tag.The parser inherits
DeepSeekR1Parser's default behavior of routing tokens toreasoning_contentuntil a</think>is seen. In thinking-disabled mode the model shouldn't be producing reasoning at all — but in practice it occasionally does (model behavior, not parser behavior). The current parser then returns:content: empty stringreasoning_content: the actual answerOpenAI-compatible chat clients render only
content, so the user sees a blank response.Why this is a bug, not a feature
NemotronV3ReasoningParseralready has aforce_nonempty_contentmechanism that does exactly the right thing — but it requires the caller to opt in viachat_template_kwargs.force_nonempty_content=True. That's a fine power-user toggle, but it doesn't match the documented intent ofenable_thinking=False.For comparison, the HuggingFace model card for
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4ships a vLLM reasoning parser (super_v3_reasoning_parser.py, revision4f0cf9d) that triggers the same swap when EITHERforce_nonempty_content=TrueORenable_thinking=False. So the TRT-LLM parser is out of sync with the model's documented inference behavior.Reproduction (occurs ~5-15% of the time in our environment)
Without the fix, 1-3 of 20 responses come back with empty
contentwhilereasoning_contenthas the answer. With the proposed fix, 20/20 return content.Proposed fix
Extend the swap-gate in
NemotronV3ReasoningParserso it ALSO fires whenchat_template_kwargs.enable_thinking is False. Same semantics asforce_nonempty_content=True, just gated on an additional condition.Four touch points, all inside the same class in
tensorrt_llm/llmapi/reasoning_parser.py:__init__: readenable_thinkingflag, store asself._enable_thinking_is_false_maybe_swap_content: extend gate to(self._force_nonempty_content or self._enable_thinking_is_false)finish: same gate extension on the missing-closing-tag branchparse_delta: same gate extension on the accumulatorBehavior when thinking is ENABLED — unchanged.
Behavior when
force_nonempty_content=True— unchanged.The new branch only fires when
enable_thinking=FalseAND the parser would otherwise return empty content.47-line unified diff (no incidental edits — strictly within
NemotronV3ReasoningParser) ready to submit. Happy to open a PR with unit tests as soon as this issue is approved.Worth considering during review
KimiK2ReasoningParser(also inheritsDeepSeekR1Parser)? Kimi K2 doesn't expose anenable_thinkingchat-template flag today, so unclear if it has the equivalent problem — happy to investigate as a follow-up.self._force_nonempty_content = Truewhenenable_thinking is False. Less code, same effect, butforce_nonempty_contentthen becomes overloaded with two meanings. Open to either direction.Environment evidence
Patched our local install of
tensorrt_llm/llmapi/reasoning_parser.pywith the proposed fix on 2026-05-21. Smoke verified:PING/PONGand a 17×23 prompt both return non-empty content withenable_thinking=False. Behavior withenable_thinking=Trueunchanged (clean<think>...</think>framing, correct math answer). No regression inforce_nonempty_content=Truepath.