fix: explicitly delete forward_data_store to prevent GPU memory leak#1638
fix: explicitly delete forward_data_store to prevent GPU memory leak#1638lilei199908 wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a GPU memory leak in the forward_only() function by explicitly deleting forward_data_store after its contents have been consumed. The issue occurs when GPU tensors accumulated in forward_data_store during forward passes are not immediately released on non-last pipeline stages, causing unnecessary memory pressure in long-running training loops.
Changes:
- Adds an explicit
del forward_data_storestatement after rollout data population to immediately free GPU memory references - Includes a comment explaining the purpose of the deletion
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| values = origin_values | ||
| rollout_data[f"{store_prefix}{key}"] = values | ||
|
|
||
| # 显式释放 forward_data_store 以避免显存泄漏 |
There was a problem hiding this comment.
The comment is written in Chinese while all other comments in this file are in English. For consistency with the rest of the codebase, this comment should be in English. Consider changing it to: "Explicitly delete forward_data_store to prevent GPU memory leak"
| # 显式释放 forward_data_store 以避免显存泄漏 | |
| # Explicitly delete forward_data_store to prevent GPU memory leak |
On non-last pipeline stages, forward_data_store accumulates GPU tensors from microbatch outputs that are never transferred to rollout_data. These tensors were held in memory until the local variable went out of scope, which in long-running training loops could delay GPU memory reclamation. Explicitly delete forward_data_store after its data has been fully consumed to release references to these tensors as early as possible.
e19c443 to
718c3d4
Compare
Summary
Fixes a GPU memory leak in
forward_only()inslime/backends/megatron_utils/model.py.Problem
forward_data_storeis a list of dicts accumulated across all microbatches during the forward pass. On non-last pipeline stages,mpu.is_pipeline_last_stage()isFalse, so none of the data inforward_data_storeis extracted intorollout_data. As a result, any GPU tensors held insideforward_data_storeremain live until the Python garbage collector reclaims them, which may be delayed in long-running training loops, causing unnecessary GPU memory pressure.Fix
Add an explicit
del forward_data_storeafter its contents have been fully consumed (after therollout_datapopulation block), so GPU tensor references are dropped as early as possible.Impact
forward_data_storeis freed immediately after use, reducing peak memory consumption during rollout.rollout_data.Notes
This is a targeted, minimal fix. A complementary improvement would be to call
torch.cuda.empty_cache()after deletion if aggressive memory reclamation is needed, but that is left as a separate concern.