[llmapi] NemotronV3ReasoningParser returns empty content when enable_thinking=False

## System Info

- TensorRT-LLM version: 1.3.0rc9
- Model: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` (revision `4f0cf9d`)
- Reasoning parser: `--reasoning_parser nano-v3`
- Tool parser: `--tool_parser qwen3_coder`
- Backend: pytorch
- GPU: NVIDIA RTX PRO 6000 Blackwell (96 GB, SM120)
- Inference cmd: `trtllm-serve serve <model> --backend pytorch --reasoning_parser nano-v3 --tool_parser qwen3_coder --enable_chunked_prefill --trust_remote_code ...`

## Description

`NemotronV3ReasoningParser` (registered as `nano-v3` in `tensorrt_llm/llmapi/reasoning_parser.py`) returns empty `content` when a request is sent with `chat_template_kwargs: {"enable_thinking": false}` AND the model leaks tokens into the reasoning stream OR fails to emit a closing `</think>` tag.

The parser inherits `DeepSeekR1Parser`'s default behavior of routing tokens to `reasoning_content` until a `</think>` is seen. In thinking-disabled mode the model shouldn't be producing reasoning at all — but in practice it occasionally does (model behavior, not parser behavior). The current parser then returns:

- `content`: empty string
- `reasoning_content`: the actual answer

OpenAI-compatible chat clients render only `content`, so the user sees a blank response.

## Why this is a bug, not a feature

`NemotronV3ReasoningParser` already has a `force_nonempty_content` mechanism that does exactly the right thing — but it requires the caller to opt in via `chat_template_kwargs.force_nonempty_content=True`. That's a fine power-user toggle, but it doesn't match the documented intent of `enable_thinking=False`.

For comparison, the HuggingFace model card for `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` ships a vLLM reasoning parser (`super_v3_reasoning_parser.py`, revision `4f0cf9d`) that triggers the same swap when EITHER `force_nonempty_content=True` OR `enable_thinking=False`. So the TRT-LLM parser is out of sync with the model's documented inference behavior.

## Reproduction (occurs ~5-15% of the time in our environment)

```bash
# Spin up trtllm-serve with Nemotron-3-Super-120B-NVFP4 + --reasoning_parser nano-v3
for i in $(seq 1 20); do
  curl -ksf https://localhost:8443/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"nemotron-super-120b-nvfp4",
         "messages":[{"role":"user","content":"What is 17 * 23?"}],
         "max_tokens":80,
         "chat_template_kwargs":{"enable_thinking":false}}' \
    | jq -r '.choices[0].message.content' &
done
wait
# Count empty responses
```

Without the fix, 1-3 of 20 responses come back with empty `content` while `reasoning_content` has the answer. With the proposed fix, 20/20 return content.

## Proposed fix

Extend the swap-gate in `NemotronV3ReasoningParser` so it ALSO fires when `chat_template_kwargs.enable_thinking is False`. Same semantics as `force_nonempty_content=True`, just gated on an additional condition.

Four touch points, all inside the same class in `tensorrt_llm/llmapi/reasoning_parser.py`:

1. `__init__`: read `enable_thinking` flag, store as `self._enable_thinking_is_false`
2. `_maybe_swap_content`: extend gate to `(self._force_nonempty_content or self._enable_thinking_is_false)`
3. `finish`: same gate extension on the missing-closing-tag branch
4. `parse_delta`: same gate extension on the accumulator

Behavior when thinking is ENABLED — unchanged.
Behavior when `force_nonempty_content=True` — unchanged.
The new branch only fires when `enable_thinking=False` AND the parser would otherwise return empty content.

47-line unified diff (no incidental edits — strictly within `NemotronV3ReasoningParser`) ready to submit. Happy to open a PR with unit tests as soon as this issue is approved.

## Worth considering during review

- Should the fix also extend to `KimiK2ReasoningParser` (also inherits `DeepSeekR1Parser`)? Kimi K2 doesn't expose an `enable_thinking` chat-template flag today, so unclear if it has the equivalent problem — happy to investigate as a follow-up.
- Alternative API: instead of adding a new instance var, the constructor could just default `self._force_nonempty_content = True` when `enable_thinking is False`. Less code, same effect, but `force_nonempty_content` then becomes overloaded with two meanings. Open to either direction.

## Environment evidence

Patched our local install of `tensorrt_llm/llmapi/reasoning_parser.py` with the proposed fix on 2026-05-21. Smoke verified: `PING`/`PONG` and a 17×23 prompt both return non-empty content with `enable_thinking=False`. Behavior with `enable_thinking=True` unchanged (clean `<think>...</think>` framing, correct math answer). No regression in `force_nonempty_content=True` path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llmapi] NemotronV3ReasoningParser returns empty content when enable_thinking=False #14502

System Info

Description

Why this is a bug, not a feature

Reproduction (occurs ~5-15% of the time in our environment)

Proposed fix

Worth considering during review

Environment evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[llmapi] NemotronV3ReasoningParser returns empty content when enable_thinking=False #14502

Description

System Info

Description

Why this is a bug, not a feature

Reproduction (occurs ~5-15% of the time in our environment)

Proposed fix

Worth considering during review

Environment evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions