Skip to content

[Bug]: Missing 'content' in response while 'reasoning_content' is present for GPT-OSS-120B #12030

@prakshalm

Description

@prakshalm

System Info

NVIDIA-SMI 580.126.18
Driver Version: 580.126.18
CUDA Version: 13.0
GPU: 2 NVIDIA RTX PRO 6000
Memory-size: 97887MiB each
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit

Who can help?

I'm trying to host gpt-oss-120b on my 2x RTX PRO GPU. Sometimes im getting the 'content' key in the output and sometimes not. The reasoning_content is always there but final answer i.e content is missing. What can be the possible issue, i read reducing the stream_interval would help but still its super inconsistent.

MODEL_DIR="/workspace/models/gpt-oss-120b"
EXTRA_CONFIG="/workspace/extra_config.yml"
TP_SIZE=2
MAX_BATCH_SIZE=64
MAX_INPUT_LEN=64000
MAX_NUM_TOKEN=64000

extra_config.yml:

disable_overlap_scheduler: true
speculative_config:
decoding_type: Eagle3
max_draft_len: 6
speculative_model: /workspace/models/eagle3-draft
enable_chunked_prefill: true
stream_interval: 1

python3 -m dynamo.trtllm
--model-path "${MODEL_DIR}"
--tensor-parallel-size "${TP_SIZE}"
--expert-parallel-size "2"
--max-batch-size "${MAX_BATCH_SIZE}"
--max-num-tokens "${MAX_NUM_TOKEN}"
--max-seq-len "${MAX_INPUT_LEN}"
--free-gpu-memory-fraction 0.85
--dyn-tool-call-parser harmony
--store-kv etcd
--extra-engine-args "${EXTRA_CONFIG}"
--dyn-reasoning-parser gpt_oss \

When sending a post request to the same:

import json
import requests

headers = {
    'Content-Type': 'application/json',
}

data = {
    "model": "/workspace/models/gpt-oss-120b",
    "messages": [
        {
        "role": "user", 
        "content": test_prompt,
    }],
    "max_tokens": 8192,

}

response = requests.post('http://localhost:8000/v1/chat/completions', headers=headers, json=data)
res = response.json()['choices'][0]['message']['content']
print(res)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run:
python3 -m dynamo.trtllm
--model-path "${MODEL_DIR}"
--tensor-parallel-size "${TP_SIZE}"
--expert-parallel-size "2"
--max-batch-size "${MAX_BATCH_SIZE}"
--max-num-tokens "${MAX_NUM_TOKEN}"
--max-seq-len "${MAX_INPUT_LEN}"
--free-gpu-memory-fraction 0.85
--dyn-tool-call-parser harmony
--store-kv etcd
--extra-engine-args "${EXTRA_CONFIG}"
--dyn-reasoning-parser gpt_oss \

and then hit the post requests.

Expected behavior

the 'content' and 'reasoning_content' both keys to present inside message key should be present in the final output

actual behavior

Current:

{'id': 'chatcmpl-b8a2ae2b-106d-4892-a04d-1b7ee1e8abaf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'reasoning_content': 'We need to output JSON with standard_declaration_key, confidence_score, justification. ....... So 9.\n\nConfidence_score integer 95.\n\nJustification: mention that filled and matches template 9.\n\nReturn JSON.\n\n'},
   'finish_reason': 'stop'}],
 'created': 1773046229,
 'model': '/workspace/models/gpt-oss-120b',
 'object': 'chat.completion',
 'usage': {'prompt_tokens': 3027,
  'completion_tokens': 602,
  'total_tokens': 3629,
  'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 3027}}}

additional notes

using pytorch as backend

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issuesbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions