-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
[2025-12-31 20:43:31 TP15 EP15] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
self.model_runner.tp_group.barrier()
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
torch.distributed.barrier(group=self.cpu_group)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
self.tp_worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
self._model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
self.graph_runner = graph_runners[self.device](self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP13 EP13] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
self.model_runner.tp_group.barrier()
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
torch.distributed.barrier(group=self.cpu_group)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
self.tp_worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
self._model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
self.graph_runner = graph_runners[self.device](self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP14 EP14] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
self.model_runner.tp_group.barrier()
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
torch.distributed.barrier(group=self.cpu_group)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
self.tp_worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
self._model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
self.graph_runner = graph_runners[self.device](self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP11 EP11] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
self.model_runner.tp_group.barrier()
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
torch.distributed.barrier(group=self.cpu_group)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
self.tp_worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
self._model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
self.init_device_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
self.graph_runner = graph_runners[self.device](self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects: Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
Reproduction
Use lmsysorg/sglang:v0.5.6.post2-rocm700-mi35x
export MODEL=deepseek-ai/DeepSeek-R1-0528
export MASTER_ADDR=xxxx:20000
export GLOO_SOCKET_IFNAME=enp49s0f1np1
export NCCL_SOCKET_IFNAME=enp49s0f1np1
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
Node 0:
python3 -m sglang.launch_server \
--model-path "$MODEL" \
--trust-remote-code \
--tp 16 --ep 16 \
--dist-init-addr "$MASTER_ADDR" \
--nnodes 2 --node-rank 0 \
--host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7
Node 1:
python3 -m sglang.launch_server \
--model-path "$MODEL" \
--trust-remote-code \
--tp 16 --ep 16 \
--dist-init-addr "$MASTER_ADDR" \
--nnodes 2 --node-rank 1 \
--host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7
Environment
root@chi-mi325x-pod2-100:/sgl-workspace# python3 -m sglang.check_env
Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7:
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 7.0.51831-a3e329ad8
ROCM Driver Version: 6.12.12
PyTorch: 2.9.0a0+git7bcbafe
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: Module Not Found
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: 3.4.0+git02502c86
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.9.2rc2.dev2065+g4f43dae12.rocm700
xgrammar: 0.1.27
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
ulimit soft: 1048576
root@chi-mi325x-pod2-100:/sgl-workspace#
Metadata
Metadata
Assignees
Labels
No labels