[Bug] Running DeepSeek-R1 on MI325x

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

```
[2025-12-31 20:43:31 TP15 EP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 15 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP13 EP13] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 13 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP14 EP14] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 14 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-31 20:43:31] Received sigquit from a child process. It usually means the child failed.
[2025-12-31 20:43:31 TP11 EP11] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 352, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 507, in capture
    _capture_one_stream()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 494, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 697, in capture_one_batch_size
    self.model_runner.tp_group.barrier()
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 1282, in barrier
    torch.distributed.barrier(group=self.cpu_group)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work = group.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2680, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 511, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2448, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 354, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Detected mismatch between collectives on ranks. Rank 11 is running collective: CollectiveFingerPrint(SequenceNumber=17OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=45OpType=BARRIER).Collectives differ in the following aspects:     Sequence number: 17vs 45
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 
```

### Reproduction

Use `lmsysorg/sglang:v0.5.6.post2-rocm700-mi35x`
```
export MODEL=deepseek-ai/DeepSeek-R1-0528
export MASTER_ADDR=xxxx:20000
export GLOO_SOCKET_IFNAME=enp49s0f1np1
export NCCL_SOCKET_IFNAME=enp49s0f1np1
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
```
Node 0:
```
python3 -m sglang.launch_server \
  --model-path "$MODEL" \
  --trust-remote-code \
  --tp 16 --ep 16 \
  --dist-init-addr "$MASTER_ADDR" \
  --nnodes 2 --node-rank 0 \
  --host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7
```
Node 1:
```
python3 -m sglang.launch_server \
  --model-path "$MODEL" \
  --trust-remote-code \
  --tp 16 --ep 16 \
  --dist-init-addr "$MASTER_ADDR" \
  --nnodes 2 --node-rank 1 \
  --host 0.0.0.0 --port 30000 --cuda-graph-bs 1 --mem-fraction-static 0.7

```

### Environment

```
root@chi-mi325x-pod2-100:/sgl-workspace# python3 -m sglang.check_env
Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: 
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 7.0.51831-a3e329ad8
ROCM Driver Version: 6.12.12
PyTorch: 2.9.0a0+git7bcbafe
sglang: 0.5.6.post2
sgl_kernel: 0.3.19
flashinfer_python: Module Not Found
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: 3.4.0+git02502c86
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.9.2rc2.dev2065+g4f43dae12.rocm700
xgrammar: 0.1.27
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
AMD Topology: 


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576
root@chi-mi325x-pod2-100:/sgl-workspace# 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Running DeepSeek-R1 on MI325x #16237

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Running DeepSeek-R1 on MI325x #16237

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions