fix: explicitly delete forward_data_store to prevent GPU memory leak by lilei199908 · Pull Request #1638 · THUDM/slime

lilei199908 · 2026-02-27T05:04:48Z

Summary

Fixes a GPU memory leak in forward_only() in slime/backends/megatron_utils/model.py.

Problem

forward_data_store is a list of dicts accumulated across all microbatches during the forward pass. On non-last pipeline stages, mpu.is_pipeline_last_stage() is False, so none of the data in forward_data_store is extracted into rollout_data. As a result, any GPU tensors held inside forward_data_store remain live until the Python garbage collector reclaims them, which may be delayed in long-running training loops, causing unnecessary GPU memory pressure.

Fix

Add an explicit del forward_data_store after its contents have been fully consumed (after the rollout_data population block), so GPU tensor references are dropped as early as possible.

Impact

Non-last pipeline stages: GPU memory held by forward_data_store is freed immediately after use, reducing peak memory consumption during rollout.
Last pipeline stage: Minor benefit — the dict/list wrapper overhead is freed, though the tensor data itself is still referenced by rollout_data.

Notes

This is a targeted, minimal fix. A complementary improvement would be to call torch.cuda.empty_cache() after deletion if aggressive memory reclamation is needed, but that is left as a separate concern.

Copilot

Pull request overview

This PR fixes a GPU memory leak in the forward_only() function by explicitly deleting forward_data_store after its contents have been consumed. The issue occurs when GPU tensors accumulated in forward_data_store during forward passes are not immediately released on non-last pipeline stages, causing unnecessary memory pressure in long-running training loops.

Changes:

Adds an explicit del forward_data_store statement after rollout data population to immediately free GPU memory references
Includes a comment explaining the purpose of the deletion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T05:07:39Z

slime/backends/megatron_utils/model.py

                values = origin_values
            rollout_data[f"{store_prefix}{key}"] = values
+
+    # 显式释放 forward_data_store 以避免显存泄漏


The comment is written in Chinese while all other comments in this file are in English. For consistency with the rest of the codebase, this comment should be in English. Consider changing it to: "Explicitly delete forward_data_store to prevent GPU memory leak"

Suggested change

# 显式释放 forward_data_store 以避免显存泄漏

# Explicitly delete forward_data_store to prevent GPU memory leak

On non-last pipeline stages, forward_data_store accumulates GPU tensors from microbatch outputs that are never transferred to rollout_data. These tensors were held in memory until the local variable went out of scope, which in long-running training loops could delay GPU memory reclamation. Explicitly delete forward_data_store after its data has been fully consumed to release references to these tensors as early as possible.

Copilot AI review requested due to automatic review settings February 27, 2026 05:04

lilei199908 added the bug Something isn't working label Feb 27, 2026

Copilot started reviewing on behalf of lilei199908 February 27, 2026 05:05 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

lilei199908 force-pushed the fix/megatron-memory-leak-forward-data-store branch from e19c443 to 718c3d4 Compare February 27, 2026 05:13

lilei199908 closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: explicitly delete forward_data_store to prevent GPU memory leak#1638

fix: explicitly delete forward_data_store to prevent GPU memory leak#1638
lilei199908 wants to merge 1 commit intomainfrom
fix/megatron-memory-leak-forward-data-store

lilei199908 commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# 显式释放 forward_data_store 以避免显存泄漏
	# Explicitly delete forward_data_store to prevent GPU memory leak

Conversation

lilei199908 commented Feb 27, 2026

Summary

Problem

Fix

Impact

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants