Skip to content

Commit 3ca3c40

Browse files
authored
[docs] add debug suggestion for ima (#1435)
1 parent 8b42882 commit 3ca3c40

File tree

2 files changed

+39
-1
lines changed

2 files changed

+39
-1
lines changed

docs/en/developer_guide/debug.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,24 @@ Specifically, slime currently provides the following parameters for separate deb
4848

4949
4. `--load-debug-rollout-data /your/saved/debug/data_{rollout_id}.pt`
5050

51-
When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
51+
When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
52+
53+
## Debug sglang illegal memory access (IMA)
54+
55+
在进行大规模 RL 时,时常会遇到 SGLang IMA 的问题,以下是我们的一些 debug 建议:
56+
57+
When running large scale RL, we will occationally meet the IMA in SGLang, there are some debug suggestions based on our experience:
58+
59+
1. Enable `CUDA_LAUNCH_BLOCKING=1`
60+
61+
2. Enable or disable speculative decoding and cuda graph to see if anything changed
62+
63+
IMA always appears in the padding in cuda graph replay, or the difference between draft model and main model. We can minimize the scope by tuning them.
64+
65+
3. Turn off deepep
66+
67+
If you are using deepep during training or inference, you can try turn it off.
68+
69+
4. Try CUDA Core Dump to find the error kernel
70+
71+
We recommend reading the blog from the vLLM team: [CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)

docs/zh/developer_guide/debug.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,21 @@ slime 支持将训练部分和推理部分分开进行调试,从而实现:
4747
3. `--load-debug-rollout-data /your/saved/debug/data_{rollout_id}.pt`
4848

4949
开启后,会从 `args.load_debug_rollout_data.format(rollout_id=rollout_id)` 来加载数据,并且不会初始化 sglang(自动设置 `debug_train_only=True`)。可以以这种方式来固定训练部分的输入,对训练部分进行调优,例如切换各种并行。
50+
51+
## Debug sglang illegal memory access (IMA)
52+
53+
在进行大规模 RL 时,不时会遇到 SGLang IMA 的问题,以下是我们的一些 debug 建议:
54+
55+
1. 开启 `CUDA_LAUNCH_BLOCKING=1`
56+
57+
2. 开启关闭投机采样、 cuda graph 来查看问题是否消除
58+
59+
IMA 经常出现在 cuda graph replay 的 padding 中,或者是投机采样与主模型的区别。经常可以通过两者的各种开关组合来缩小问题。
60+
61+
3. 关闭 deepep
62+
63+
如果训练和推理中开启了 deepep,可以关闭查看有没有差别
64+
65+
4. 尝试 CUDA Core Dump 确定报错 kernel
66+
67+
这里推荐 vllm 团队的这一文档:[CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)

0 commit comments

Comments
 (0)