fix: explicitly delete forward_data_store to prevent GPU memory leak

User · User · commit e19c4438e1aa · 2026-02-27T13:04:34.000+08:00
On non-last pipeline stages, forward_data_store accumulates GPU tensors
from microbatch outputs that are never transferred to rollout_data. These
tensors were held in memory until the local variable went out of scope,
which in long-running training loops could delay GPU memory reclamation.

Explicitly delete forward_data_store after its data has been fully
consumed to release references to these tensors as early as possible.
diff --git a/slime/backends/megatron_utils/model.py b/slime/backends/megatron_utils/model.py
@@ -293,6 +293,10 @@ def forward_step(
                     origin_values[origin_index] = value
                 values = origin_values
             rollout_data[f"{store_prefix}{key}"] = values
+    
+    # 显式释放 forward_data_store 以避免显存泄漏
+    del forward_data_store
+    
     return rollout_data