Skip to content

[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.#5549

Open
sheilaliuxl wants to merge 2 commits intoverl-project:mainfrom
sheilaliuxl:dev-gc
Open

[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.#5549
sheilaliuxl wants to merge 2 commits intoverl-project:mainfrom
sheilaliuxl:dev-gc

Conversation

@sheilaliuxl
Copy link
Contributor

@sheilaliuxl sheilaliuxl commented Mar 10, 2026

What does this PR do?

Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+is%3Aopen+gc
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

I: Step Time
It decreases training step time consistently with gc_freq=1 (Dropping first 10 steps and last 3 steps):

  • All jobs run on the same node ip-10-4-145-27, to reduce inconsistency across different nodes
  1. Median time: 0.8652 ==> 0.7247 (-16.2%)
  2. Avg time: 0.8828 ==> 0.7324 (-17.0%)
  3. Std time also decreases 0.0696 ==> 0.0362 (-48.0%) or improvement of 1.92x
    • Important to compare across experiments whether step time reduction/ MFU increase is significant

II: MFU

MFU increases slightly on the worker as well, from gc_freq = -1 to 1:

  1. avg MFU increases 0.2221 ==> 0.2297 (+3.4%): 2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 0.2297 [+ 0.0076] ( 0.0020), 0.2301) # train/mfu
  2. std MFU decreases 00093 ==> 0.0020 (-78.5%) or improvement of 4.7x
../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/sliuxl-qwen2d5_3b-verl-92593.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/sliuxl-qwen2d5_3b-verl-92594.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/sliuxl-qwen2d5_3b-verl-92595.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/sliuxl-qwen2d5_3b-verl-92596.out:hostname=ip-10-4-145-27


../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/run_verl.sh:    trainer.gc_freq=-1 \
../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/run_verl.sh:    trainer.gc_freq=0 \
../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/run_verl.sh:    trainer.gc_freq=1 \
../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/run_verl.sh:    trainer.gc_freq=5 \

1. Step Time

+ python /fsx/ubuntu/users/sliuxl/tb.py --files '../*run-main-gc-03*/events*' --metric 'train/time(s)' --format 10.4f
2026-03-13 17:34:42,155 [tb.py:235] INFO - Set benchmark with index 0
2026-03-13 17:34:42,156 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/events.out.tfevents.1773421377.ip-10-4-145-27.2342312.0: (avg (std), med) = (    0.8828 (    0.0696),     0.8652)  # train/time(s)
2026-03-13 17:34:42,246 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/events.out.tfevents.1773421852.ip-10-4-145-27.2346125.0: (avg (std), med) = (    0.8562 (    0.0977),     0.9187)  # train/time(s)
2026-03-13 17:34:42,336 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/events.out.tfevents.1773422312.ip-10-4-145-27.2349978.0: (avg (std), med) = (    0.7324 (    0.0362),     0.7247)  # train/time(s)
2026-03-13 17:34:42,428 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/events.out.tfevents.1773422840.ip-10-4-145-27.2353891.0: (avg (std), med) = (    0.7509 (    0.0366),     0.7474)  # train/time(s)

2. All Metrics

  1. data/train_seq_len* metrics are exactly the same
  2. train/time(s) decrease and train/mfu increases
+ python /fsx/ubuntu/users/sliuxl/tb.py --files '../*run-main-gc-03*/events*' --metric '*' --format 10.4f
2026-03-13 17:34:52,636 [tb.py:206] WARNING - Unknonw metric `*`!
2026-03-13 17:34:52,636 [tb.py:216] INFO - 

[00/04] File ../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/events.out.tfevents.1773421377.ip-10-4-145-27.2342312.0:
2026-03-13 17:34:52,729 [tb.py:235] INFO - Set benchmark with index 0
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2912 [     0.0000] (    0.1269),     0.2669)  # train/grad_norm
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [     0.0000] (    0.0046),     0.0199)  # train/loss
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2221 [     0.0000] (    0.0093),     0.2253)  # train/mfu
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.8828 [     0.0000] (    0.0696),     0.8652)  # train/time(s)
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,732 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,732 [tb.py:216] INFO - 

[01/04] File ../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/events.out.tfevents.1773421852.ip-10-4-145-27.2346125.0:
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2900 [-    0.0012] (    0.1189),     0.2685)  # train/grad_norm
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [     0.0000] (    0.0046),     0.0199)  # train/loss
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2288 [+    0.0067] (    0.0029),     0.2290)  # train/mfu
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.8562 [-    0.0266] (    0.0977),     0.9187)  # train/time(s)
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,825 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,825 [tb.py:216] INFO - 

[02/04] File ../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/events.out.tfevents.1773422312.ip-10-4-145-27.2349978.0:
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2897 [-    0.0015] (    0.1242),     0.2696)  # train/grad_norm
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0198 [-    0.0000] (    0.0046),     0.0200)  # train/loss
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2297 [+    0.0076] (    0.0020),     0.2301)  # train/mfu
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.7324 [-    0.1504] (    0.0362),     0.7247)  # train/time(s)
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,918 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,918 [tb.py:216] INFO - 

[03/04] File ../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/events.out.tfevents.1773422840.ip-10-4-145-27.2353891.0:
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2907 [-    0.0005] (    0.1227),     0.2702)  # train/grad_norm
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [+    0.0000] (    0.0046),     0.0200)  # train/loss
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2289 [+    0.0068] (    0.0020),     0.2292)  # train/mfu
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.7509 [-    0.1320] (    0.0366),     0.7474)  # train/time(s)
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:53,011 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss

API and Usage Example

Add trainer.gc_freq=1 to training config:

../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/run_verl.sh:    trainer.gc_freq=1 \

Design & Code Changes

N.A.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to manually control garbage collection during training to stabilize step times. A gc_freq configuration is added to disable automatic GC and periodically run gc.collect(). While the core idea is sound and the performance benefits are clear from the description, the implementation has a critical flaw: it doesn't guarantee that the garbage collector will be re-enabled if the training loop exits unexpectedly or completes normally without hitting the final step condition. This can lead to memory leaks. I've added comments with suggestions to use a try...finally block to make the GC management robust.

@sheilaliuxl sheilaliuxl changed the title [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer<, _ray>.py to avoid spikes in step time. [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 11, 2026
@sheilaliuxl sheilaliuxl changed the title [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. [DRAFT][perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 13, 2026
@sheilaliuxl sheilaliuxl changed the title [DRAFT][perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants