- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.3k
 
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
 - 2. The bug has not been fixed in the latest version.
 - 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
 - 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
 - 5. Please use English, otherwise it will be closed.
 
Describe the bug
Testing gpt-oss-120b with spec decoding with overlap scheduler and noticing a few issues:
- Garbage output issue (potentially related to [Bug] gpt-oss outputs non-sense #12563 and [Bug] Looking to get an understanding for garbage output issue on concurrent XGrammar usage (SGLang, vLLM) mlc-ai/xgrammar#460)
 - Using 
"tool_choice": "required"fails the server 
Logs
- First issue - garbage / low quality output
 
$ curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Say Hello to John"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ]
       }'
{"id":"f7b96555bd58470eb2df210dca4f3efe","object":"chat.completion","created":1762196179,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello, John!!.! \n\n\n\n?  (unint \n. \n\n.\n\n---\n\n---\n\n\n\n--- \n\n---. \n\n  #\n\n .. \n\n---\nThe user asked: \"\"? .... This seems nonsensical. Ap....\n\nWait? It looks like. ....\n\n\n\n\n\n\n\n\n\nProbably.\n\n() \n\nI shouldn't...\n\n\n\n Ap...\n\n? \n\n\n\n \n\n--- \n\n\n \n\n\nI\n\n\n\n \n\nMy... \n\n\n\n--- \n\n\n\n\n\n\n\n\n \nAp \n\n...<|end|><|start|>assistant<|channel|>final<|message|>Hello, John!","reasoning_content":". The user wants. Simple greeting. Answer.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":200002}],"usage":{"prompt_tokens":155,"total_tokens":391,"completion_tokens":236,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
- Using 
"tool_choice": "required"fails the server 
Client log
$ curl -X POST \
     -H "Content-Type: application/json" \
     http://localhost:7080/v1/chat/completions \
     -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Whats the weather in London?"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ],
         "stream": true,
         "tool_choice": "required"
       }'
curl: (7) Failed to connect to localhost port 7080 after 0 ms: Couldn't connect to server
Server logs
[2025-11-03 18:48:33] INFO:     127.0.0.1:34972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-03 18:48:33] Receive: obj=GenerateReqInput(rid='5c0869f04dbd4b71b83da3e3298f57d8', http_worker_ipc=None, video_data=None, sampling_params={'temperature': 1.0, 'max_new_tokens': None, 'min_new_tokens': 0, 'stop': None, 'stop_token_ids': None, 'stop_regex': None, 'top_p': 1.0, 'top_k': 50, 'min_p': 0.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'regex': None, 'ebnf': None, 'n': 1, 'no_stop_trim': False, 'ignore_eos': False, 'skip_special_tokens': False, 'logit_bias': None, 'custom_params': None, 'json_schema': '{"type": "array", "minItems": 1, "items": {"type": "object", "anyOf": [{"properties": {"name": {"type": "string", "enum": ["get_weather"]}, "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g., San Francisco, CA"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit for temperature"}}, "required": ["location"]}}, "required": ["name", "parameters"]}]}}'}, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=True, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False)
[2025-11-03 18:48:33 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 128, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-03 18:48:34 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2025-11-03 18:48:34 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
[2025-11-03 18:48:34 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2810, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 999, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1987, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 569, in forward_batch_generation
    batch_output.next_draft_input = self.draft_worker._draft_extend_for_prefill(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 412, in _draft_extend_for_prefill
    (input_ids[1:], next_token_ids[i].reshape(1))
                    ~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
Without the spec settings below is what we get:
- Reasonable output without much necessary tokens
 
curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Say Hello to John"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ]
       }'
{"id":"39722d4afb4e44f4bb5606fb995fd73b","object":"chat.completion","created":1762196874,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"Hello, John!","reasoning_content":"The user says \"Say Hello to John\". The assistant should respond with a greeting to John.\n\nWe just need to output \"Hello, John!\" or similar.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":200002}],"usage":{"prompt_tokens":155,"total_tokens":201,"completion_tokens":46,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
- "tool_choice": "required" doesn't fail the server
 
curl -X POST      -H "Content-Type: application/json"      http://localhost:7080/v1/chat/completions      -d '{
         "model": "openai/gpt-oss-120b",
         "messages": [
           {"role": "user", "content": "Whats the weather in London?"}
         ],
         "tools": [
           {
             "type": "function",
             "function": {
               "name": "get_weather",
               "description": "Get the current weather for a specific location.",
               "parameters": {
                 "type": "object",
                 "properties": {
                   "location": {
                     "type": "string",
                     "description": "The city and state, e.g., San Francisco, CA"
                   },
                   "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"],
                     "description": "The unit for temperature"
                   }
                 },
                 "required": ["location"]
               }
             }
           }
         ],
         "tool_choice": "required"
       }'
{"id":"1734497ddb274729b6ee2c90ea94c1d0","object":"chat.completion","created":1762196900,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":"Need to call get_weather function with location \"London\". Probably need unit default (maybe celsius). Use function call.","tool_calls":[{"id":"call_816f5ef7bf9b456abd7fb23d","index":0,"type":"function","function":{"name":"get_weather","arguments":"{\"location\": \"London\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":157,"total_tokens":212,"completion_tokens":55,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
Reproduction
We are using B200 in above examples.
Server start up
- docker run -it --name "gpt_oss_120b_debug" --shm-size 256g --gpus all -v "$HOME/.cache/huggingface/:/root/.cache/huggingface" -v "/tmp:/tmp" --env "SGLANG_ENABLE_SPEC_V2=1" --ipc=host --network=host --privileged --entrypoint=bash lmsysorg/sglang:v0.5.4.post2
 - docker exec -it gpt_oss_120b_debug bash
 - Spec setting: python -m sglang.launch_server --port=7080 --model=/tmp/openai/gpt-oss-120b --trust-remote-code --tp=8 --max-queued-requests=256 --tool-call-parser=gpt-oss --reasoning-parser=gpt-oss --enable-metrics --enable-priority-scheduling --schedule-low-priority-values-first --priority-scheduling-preemption-threshold=1000 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --attention-backend trtllm_mha --max-running-requests=128 --speculative-algorithm EAGLE3 --speculative-draft-model-path=/tmp/lmsys/EAGLE3-gpt-oss-120b-bf16
 - Non-spec setting: python -m sglang.launch_server --port=7080 --model=/tmp/openai/gpt-oss-120b --trust-remote-code --tp=8 --max-queued-requests=256 --tool-call-parser=gpt-oss --reasoning-parser=gpt-oss --enable-metrics --enable-priority-scheduling --schedule-low-priority-values-first --priority-scheduling-preemption-threshold=1000
 
Environment
/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.65.06
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post2
sgl_kernel: 0.3.16.post4
flashinfer_python: 0.4.1
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.4
aiohttp: 3.13.2
fastapi: 0.120.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.2
pydantic: 2.12.3
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.72.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-55,112-167	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-55,112-167	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-55,112-167	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-55,112-167	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	56-111,168-223	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	56-111,168-223	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	56-111,168-223	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	56-111,168-223	1		N/A
Legend:
X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX  = Connection traversing at most a single PCIe bridge
NV#  = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576