Commit e19c443
User
fix: explicitly delete forward_data_store to prevent GPU memory leak
On non-last pipeline stages, forward_data_store accumulates GPU tensors
from microbatch outputs that are never transferred to rollout_data. These
tensors were held in memory until the local variable went out of scope,
which in long-running training loops could delay GPU memory reclamation.
Explicitly delete forward_data_store after its data has been fully
consumed to release references to these tensors as early as possible.1 parent 0257bd6 commit e19c443
1 file changed
+4
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
293 | 293 | | |
294 | 294 | | |
295 | 295 | | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
296 | 300 | | |
297 | 301 | | |
298 | 302 | | |
| |||
0 commit comments