Skip to content

[Bug] gpt-oss-120b + spec + overlap scheduler issues (garbage output & server failure on "tool_choice": "required") #12567

@harrisonlimh

Description

@harrisonlimh

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Testing gpt-oss-120b with spec decoding with overlap scheduler and noticing a few issues:

  1. Garbage output issue (potentially related to [Bug] gpt-oss outputs non-sense #12563 and [Bug] Looking to get an understanding for garbage output issue on concurrent XGrammar usage (SGLang, vLLM) mlc-ai/xgrammar#460)
  2. Using "tool_choice": "required" fails the server

Logs

  1. First issue - garbage / low quality output
$ curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Say Hello to John"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ]
       }'
{"id":"f7b96555bd58470eb2df210dca4f3efe","object":"chat.completion","created":1762196179,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello, John!!.! \n\n\n\n?  (unint \n. \n\n.\n\n---\n\n---\n\n\n\n--- \n\n---. \n\n  #\n\n .. \n\n---\nThe user asked: \"\"? .... This seems nonsensical. Ap....\n\nWait? It looks like. ....\n\n\n\n\n\n\n\n\n\nProbably.\n\n() \n\nI shouldn't...\n\n\n\n Ap...\n\n? \n\n\n\n \n\n--- \n\n\n \n\n\nI\n\n\n\n \n\nMy... \n\n\n\n--- \n\n\n\n\n\n\n\n\n \nAp \n\n...<|end|><|start|>assistant<|channel|>final<|message|>Hello, John!","reasoning_content":". The user wants. Simple greeting. Answer.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":200002}],"usage":{"prompt_tokens":155,"total_tokens":391,"completion_tokens":236,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
  1. Using "tool_choice": "required" fails the server

Client log

$ curl -X POST \
     -H "Content-Type: application/json" \
     http://localhost:7080/v1/chat/completions \
     -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Whats the weather in London?"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ],
         "stream": true,
         "tool_choice": "required"
       }'
curl: (7) Failed to connect to localhost port 7080 after 0 ms: Couldn't connect to server

Server logs

[2025-11-03 18:48:33] INFO:     127.0.0.1:34972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-03 18:48:33] Receive: obj=GenerateReqInput(rid='5c0869f04dbd4b71b83da3e3298f57d8', http_worker_ipc=None, video_data=None, sampling_params={'temperature': 1.0, 'max_new_tokens': None, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': False, 'logit_bias': None, 'custom_params': None, 'json_schema': '{"type": "array", "minItems": 1, "items": {"type": "object", "anyOf": [{"properties": {"name": {"type": "string", "enum": ["get_weather"]}, "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g., San Francisco, CA"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit for temperature"}}, "required": ["location"]}}, "required": ["name", "parameters"]}]}}'}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=True, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False)
[2025-11-03 18:48:33 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 128, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-03 18:48:34 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2025-11-03 18:48:34 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

[2025-11-03 18:48:34 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

Without the spec settings below is what we get:

  1. Reasonable output without much necessary tokens
curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Say Hello to John"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ]
       }'
{"id":"39722d4afb4e44f4bb5606fb995fd73b","object":"chat.completion","created":1762196874,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello, John!","reasoning_content":"The user says \"Say Hello to John\". The assistant should respond with a greeting to John.\n\nWe just need to output \"Hello, John!\" or similar.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":200002}],"usage":{"prompt_tokens":155,"total_tokens":201,"completion_tokens":46,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
  1. "tool_choice": "required" doesn't fail the server
curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Whats the weather in London?"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ],
         "tool_choice": "required"
       }'
{"id":"1734497ddb274729b6ee2c90ea94c1d0","object":"chat.completion","created":1762196900,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":"Need to call get_weather function with location \"London\". Probably need unit default (maybe celsius). Use function call.","tool_calls":[{"id":"call_816f5ef7bf9b456abd7fb23d","index":0,"type":"function","function":{"name":"get_weather","arguments":"{\"location\": \"London\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":157,"total_tokens":212,"completion_tokens":55,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Reproduction

We are using B200 in above examples.
Server start up

  • docker run -it --name "gpt_oss_120b_debug" --shm-size 256g --gpus all -v "$HOME/.cache/huggingface/:/root/.cache/huggingface" -v "/tmp:/tmp" --env "SGLANG_ENABLE_SPEC_V2=1" --ipc=host --network=host --privileged --entrypoint=bash lmsysorg/sglang:v0.5.4.post2
  • docker exec -it gpt_oss_120b_debug bash
  • Spec setting: python -m sglang.launch_server --port=7080 --model=/tmp/openai/gpt-oss-120b --trust-remote-code --tp=8 --max-queued-requests=256 --tool-call-parser=gpt-oss --reasoning-parser=gpt-oss --enable-metrics --enable-priority-scheduling --schedule-low-priority-values-first --priority-scheduling-preemption-threshold=1000 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --attention-backend trtllm_mha --max-running-requests=128 --speculative-algorithm EAGLE3 --speculative-draft-model-path=/tmp/lmsys/EAGLE3-gpt-oss-120b-bf16
  • Non-spec setting: python -m sglang.launch_server --port=7080 --model=/tmp/openai/gpt-oss-120b --trust-remote-code --tp=8 --max-queued-requests=256 --tool-call-parser=gpt-oss --reasoning-parser=gpt-oss --enable-metrics --enable-priority-scheduling --schedule-low-priority-values-first --priority-scheduling-preemption-threshold=1000

Environment

/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.65.06
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post2
sgl_kernel: 0.3.16.post4
flashinfer_python: 0.4.1
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.4
aiohttp: 3.13.2
fastapi: 0.120.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.2
pydantic: 2.12.3
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.72.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 56-111,168-223 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions