[docs] add debug suggestion for ima (#1435)

zhuzilin · web-flow · commit 3ca3c409043f · 2026-01-16T20:14:18.000+08:00
diff --git a/docs/en/developer_guide/debug.md b/docs/en/developer_guide/debug.md
@@ -48,4 +48,24 @@ Specifically, slime currently provides the following parameters for separate deb
 
 4.  `--load-debug-rollout-data /your/saved/debug/data_{rollout_id}.pt`
 
-    When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
+    When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
+
+## Debug sglang illegal memory access (IMA)
+
+在进行大规模 RL 时，时常会遇到 SGLang IMA 的问题，以下是我们的一些 debug 建议：
+
+When running large scale RL, we will occationally meet the IMA in SGLang, there are some debug suggestions based on our experience:
+
+1. Enable `CUDA_LAUNCH_BLOCKING=1`
+
+2. Enable or disable speculative decoding and cuda graph to see if anything changed
+
+   IMA always appears in the padding in cuda graph replay, or the difference between draft model and main model. We can minimize the scope by tuning them.
+
+3. Turn off deepep
+
+   If you are using deepep during training or inference, you can try turn it off.
+
+4. Try CUDA Core Dump to find the error kernel
+
+   We recommend reading the blog from the vLLM team: [CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)
diff --git a/docs/zh/developer_guide/debug.md b/docs/zh/developer_guide/debug.md
@@ -47,3 +47,21 @@ slime 支持将训练部分和推理部分分开进行调试，从而实现：
 3. `--load-debug-rollout-data /your/saved/debug/data_{rollout_id}.pt`
 
    开启后，会从 `args.load_debug_rollout_data.format(rollout_id=rollout_id)` 来加载数据，并且不会初始化 sglang（自动设置 `debug_train_only=True`）。可以以这种方式来固定训练部分的输入，对训练部分进行调优，例如切换各种并行。
+
+## Debug sglang illegal memory access (IMA)
+
+在进行大规模 RL 时，不时会遇到 SGLang IMA 的问题，以下是我们的一些 debug 建议：
+
+1. 开启 `CUDA_LAUNCH_BLOCKING=1`
+
+2. 开启关闭投机采样、 cuda graph 来查看问题是否消除
+
+   IMA 经常出现在 cuda graph replay 的 padding 中，或者是投机采样与主模型的区别。经常可以通过两者的各种开关组合来缩小问题。
+
+3. 关闭 deepep
+
+   如果训练和推理中开启了 deepep，可以关闭查看有没有差别
+
+4. 尝试 CUDA Core Dump 确定报错 kernel
+
+   这里推荐 vllm 团队的这一文档：[CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)