Skip to content

[Bug] Running DeepSeek-R1 on MI325x #16237

@MaoZiming

Description

@MaoZiming

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

[2025-12-31 20:43:31 TP15 EP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP13 EP13] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP14 EP14] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP11 EP11] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

Reproduction

Use lmsysorg/sglang:v0.5.6.post2-rocm700-mi35x

export MODEL=deepseek-ai/DeepSeek-R1-0528
export MASTER_ADDR=xxxx:20000
export GLOO_SOCKET_IFNAME=enp49s0f1np1
export NCCL_SOCKET_IFNAME=enp49s0f1np1
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Node 0:

python3 -m sglang.launch_server \
  --model-path "$MODEL" \
  --trust-remote-code \
  --tp 16 --ep 16 \
  --dist-init-addr "$MASTER_ADDR" \
  --nnodes 2 --node-rank 0 \
  --host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7

Node 1:

python3 -m sglang.launch_server \
  --model-path "$MODEL" \
  --trust-remote-code \
  --tp 16 --ep 16 \
  --dist-init-addr "$MASTER_ADDR" \
  --nnodes 2 --node-rank 1 \
  --host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7

Environment

root@chi-mi325x-pod2-100:/sgl-workspace# python3 -m sglang.check_env
Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: 
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 7.0.51831-a3e329ad8
ROCM Driver Version: 6.12.12
PyTorch: 2.9.0a0+git7bcbafe
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: Module Not Found
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: 3.4.0+git02502c86
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.9.2rc2.dev2065+g4f43dae12.rocm700
xgrammar: 0.1.27
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
AMD Topology: 


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576
root@chi-mi325x-pod2-100:/sgl-workspace# 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions