-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
NVIDIA-SMI 580.126.18
Driver Version: 580.126.18
CUDA Version: 13.0
GPU: 2 NVIDIA RTX PRO 6000
Memory-size: 97887MiB each
CPU Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Who can help?
I'm trying to host gpt-oss-120b on my 2x RTX PRO GPU. Sometimes im getting the 'content' key in the output and sometimes not. The reasoning_content is always there but final answer i.e content is missing. What can be the possible issue, i read reducing the stream_interval would help but still its super inconsistent.
MODEL_DIR="/workspace/models/gpt-oss-120b"
EXTRA_CONFIG="/workspace/extra_config.yml"
TP_SIZE=2
MAX_BATCH_SIZE=64
MAX_INPUT_LEN=64000
MAX_NUM_TOKEN=64000
extra_config.yml:
disable_overlap_scheduler: true
speculative_config:
decoding_type: Eagle3
max_draft_len: 6
speculative_model: /workspace/models/eagle3-draft
enable_chunked_prefill: true
stream_interval: 1
python3 -m dynamo.trtllm
--model-path "${MODEL_DIR}"
--tensor-parallel-size "${TP_SIZE}"
--expert-parallel-size "2"
--max-batch-size "${MAX_BATCH_SIZE}"
--max-num-tokens "${MAX_NUM_TOKEN}"
--max-seq-len "${MAX_INPUT_LEN}"
--free-gpu-memory-fraction 0.85
--dyn-tool-call-parser harmony
--store-kv etcd
--extra-engine-args "${EXTRA_CONFIG}"
--dyn-reasoning-parser gpt_oss \
When sending a post request to the same:
import json
import requests
headers = {
'Content-Type': 'application/json',
}
data = {
"model": "/workspace/models/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": test_prompt,
}],
"max_tokens": 8192,
}
response = requests.post('http://localhost:8000/v1/chat/completions', headers=headers, json=data)
res = response.json()['choices'][0]['message']['content']
print(res)Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run:
python3 -m dynamo.trtllm
--model-path "${MODEL_DIR}"
--tensor-parallel-size "${TP_SIZE}"
--expert-parallel-size "2"
--max-batch-size "${MAX_BATCH_SIZE}"
--max-num-tokens "${MAX_NUM_TOKEN}"
--max-seq-len "${MAX_INPUT_LEN}"
--free-gpu-memory-fraction 0.85
--dyn-tool-call-parser harmony
--store-kv etcd
--extra-engine-args "${EXTRA_CONFIG}"
--dyn-reasoning-parser gpt_oss \
and then hit the post requests.
Expected behavior
the 'content' and 'reasoning_content' both keys to present inside message key should be present in the final output
actual behavior
Current:
{'id': 'chatcmpl-b8a2ae2b-106d-4892-a04d-1b7ee1e8abaf',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'reasoning_content': 'We need to output JSON with standard_declaration_key, confidence_score, justification. ....... So 9.\n\nConfidence_score integer 95.\n\nJustification: mention that filled and matches template 9.\n\nReturn JSON.\n\n'},
'finish_reason': 'stop'}],
'created': 1773046229,
'model': '/workspace/models/gpt-oss-120b',
'object': 'chat.completion',
'usage': {'prompt_tokens': 3027,
'completion_tokens': 602,
'total_tokens': 3629,
'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 3027}}}
additional notes
using pytorch as backend
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.