[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. by sheilaliuxl · Pull Request #5549 · verl-project/verl

sheilaliuxl · 2026-03-10T21:20:36Z

What does this PR do?

Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.

NOTE: Current PR is on top of [training_utils] fix: Fix total train_time in sft_trainer.py and add 1) step time 2) seq len stats for training data. #5579

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+is%3Aopen+gc
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

I: Step Time
It decreases training step time consistently with gc_freq=1 (Dropping first 10 steps and last 3 steps):

All jobs run on the same node ip-10-4-145-27, to reduce inconsistency across different nodes

Median time: 0.8652 ==> 0.7247 (-16.2%)
Avg time: 0.8828 ==> 0.7324 (-17.0%)
Std time also decreases 0.0696 ==> 0.0362 (-48.0%) or improvement of 1.92x
- Important to compare across experiments whether step time reduction/ MFU increase is significant

II: MFU

MFU increases slightly on the worker as well, from gc_freq = -1 to 1:

avg MFU increases 0.2221 ==> 0.2297 (+3.4%): 2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 0.2297 [+ 0.0076] ( 0.0020), 0.2301) # train/mfu
std MFU decreases 00093 ==> 0.0020 (-78.5%) or improvement of 4.7x

../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/sliuxl-qwen2d5_3b-verl-92593.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/sliuxl-qwen2d5_3b-verl-92594.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/sliuxl-qwen2d5_3b-verl-92595.out:hostname=ip-10-4-145-27
../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/sliuxl-qwen2d5_3b-verl-92596.out:hostname=ip-10-4-145-27


../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/run_verl.sh:    trainer.gc_freq=-1 \
../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/run_verl.sh:    trainer.gc_freq=0 \
../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/run_verl.sh:    trainer.gc_freq=1 \
../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/run_verl.sh:    trainer.gc_freq=5 \

1. `Step Time`

+ python /fsx/ubuntu/users/sliuxl/tb.py --files '../*run-main-gc-03*/events*' --metric 'train/time(s)' --format 10.4f
2026-03-13 17:34:42,155 [tb.py:235] INFO - Set benchmark with index 0
2026-03-13 17:34:42,156 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/events.out.tfevents.1773421377.ip-10-4-145-27.2342312.0: (avg (std), med) = (    0.8828 (    0.0696),     0.8652)  # train/time(s)
2026-03-13 17:34:42,246 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/events.out.tfevents.1773421852.ip-10-4-145-27.2346125.0: (avg (std), med) = (    0.8562 (    0.0977),     0.9187)  # train/time(s)
2026-03-13 17:34:42,336 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/events.out.tfevents.1773422312.ip-10-4-145-27.2349978.0: (avg (std), med) = (    0.7324 (    0.0362),     0.7247)  # train/time(s)
2026-03-13 17:34:42,428 [tb.py:264] INFO - [243/256] ../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/events.out.tfevents.1773422840.ip-10-4-145-27.2353891.0: (avg (std), med) = (    0.7509 (    0.0366),     0.7474)  # train/time(s)

2. All Metrics

data/train_seq_len* metrics are exactly the same
train/time(s) decrease and train/mfu increases

+ python /fsx/ubuntu/users/sliuxl/tb.py --files '../*run-main-gc-03*/events*' --metric '*' --format 10.4f
2026-03-13 17:34:52,636 [tb.py:206] WARNING - Unknonw metric `*`!
2026-03-13 17:34:52,636 [tb.py:216] INFO - 

[00/04] File ../gc_config_pbtxt-5d4ffd4f--92593-run-main-gc-03-j-01-seq04k-ns01-sp01-bs01-pad1-gc-1--20260313.100151/events.out.tfevents.1773421377.ip-10-4-145-27.2342312.0:
2026-03-13 17:34:52,729 [tb.py:235] INFO - Set benchmark with index 0
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,730 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2912 [     0.0000] (    0.1269),     0.2669)  # train/grad_norm
2026-03-13 17:34:52,731 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [     0.0000] (    0.0046),     0.0199)  # train/loss
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2221 [     0.0000] (    0.0093),     0.2253)  # train/mfu
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.8828 [     0.0000] (    0.0696),     0.8652)  # train/time(s)
2026-03-13 17:34:52,732 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,732 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,732 [tb.py:216] INFO - 

[01/04] File ../gc_config_pbtxt-5d4ffd4f--92594-run-main-gc-03-j-02-seq04k-ns01-sp01-bs01-pad1-gc0--20260313.100156/events.out.tfevents.1773421852.ip-10-4-145-27.2346125.0:
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,823 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2900 [-    0.0012] (    0.1189),     0.2685)  # train/grad_norm
2026-03-13 17:34:52,824 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [     0.0000] (    0.0046),     0.0199)  # train/loss
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2288 [+    0.0067] (    0.0029),     0.2290)  # train/mfu
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.8562 [-    0.0266] (    0.0977),     0.9187)  # train/time(s)
2026-03-13 17:34:52,825 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,825 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,825 [tb.py:216] INFO - 

[02/04] File ../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/events.out.tfevents.1773422312.ip-10-4-145-27.2349978.0:
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:52,916 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2897 [-    0.0015] (    0.1242),     0.2696)  # train/grad_norm
2026-03-13 17:34:52,917 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0198 [-    0.0000] (    0.0046),     0.0200)  # train/loss
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2297 [+    0.0076] (    0.0020),     0.2301)  # train/mfu
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.7324 [-    0.1504] (    0.0362),     0.7247)  # train/time(s)
2026-03-13 17:34:52,918 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:52,918 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss
2026-03-13 17:34:52,918 [tb.py:216] INFO - 

[03/04] File ../gc_config_pbtxt-5d4ffd4f--92596-run-main-gc-03-j-04-seq04k-ns01-sp01-bs01-pad1-gc5--20260313.100206/events.out.tfevents.1773422840.ip-10-4-145-27.2353891.0:
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    8.0000 [     0.0000] (    0.0000),     8.0000)  # data/train_seq_len_count
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_max
2026-03-13 17:34:53,009 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_mean
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_median
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = ( 4096.0000 [     0.0000] (    0.0000),  4096.0000)  # data/train_seq_len_min
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # data/train_seq_len_std
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (32768.0000 [     0.0000] (    0.0000), 32768.0000)  # train/global_tokens
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2907 [-    0.0005] (    0.1227),     0.2702)  # train/grad_norm
2026-03-13 17:34:53,010 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0199 [+    0.0000] (    0.0046),     0.0200)  # train/loss
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # train/lr
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.2289 [+    0.0068] (    0.0020),     0.2292)  # train/mfu
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.7509 [-    0.1320] (    0.0366),     0.7474)  # train/time(s)
2026-03-13 17:34:53,011 [tb.py:264] INFO - [243/256] : (avg (std), med) = (    0.0043 [     0.0000] (    0.0023),     0.0043)  # train/total_tokens(B)
2026-03-13 17:34:53,011 [tb.py:264] INFO - [00/01] : (avg (std), med) = (    0.0000 [     0.0000] (    0.0000),     0.0000)  # val/loss

API and Usage Example

Add trainer.gc_freq=1 to training config:

../gc_config_pbtxt-5d4ffd4f--92595-run-main-gc-03-j-03-seq04k-ns01-sp01-bs01-pad1-gc1--20260313.100201/run_verl.sh:    trainer.gc_freq=1 \

Design & Code Changes

N.A.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request introduces a mechanism to manually control garbage collection during training to stabilize step times. A gc_freq configuration is added to disable automatic GC and periodically run gc.collect(). While the core idea is sound and the performance benefits are clear from the description, the implementation has a critical flaw: it doesn't guarantee that the garbage collector will be re-enabled if the training loop exits unexpectedly or completes normally without hitting the final step condition. This can lead to memory leaks. I've added comments with suggestions to use a try...finally block to make the GC management robust.

verl/trainer/sft_trainer.py

verl/trainer/sft_trainer_ray.py

…n stats for training data.

sheilaliuxl requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners March 10, 2026 21:20

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

verl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

verl/trainer/sft_trainer_ray.py Outdated Show resolved Hide resolved

sheilaliuxl force-pushed the dev-gc branch from 7e6a1d7 to 4a60f75 Compare March 10, 2026 21:29

vermouth1992 reviewed Mar 11, 2026

View reviewed changes

verl/trainer/sft_trainer_ray.py Outdated Show resolved Hide resolved

sheilaliuxl force-pushed the dev-gc branch from 4a60f75 to 9705a0a Compare March 11, 2026 02:35

sheilaliuxl changed the title ~~[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer<, _ray>.py to avoid spikes in step time.~~ [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 11, 2026

Fix total train_time in sft_trainer.py and add 1) step time 2) seq le…

bae054c

…n stats for training data.

sheilaliuxl changed the title ~~[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.~~ [DRAFT][perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 13, 2026

Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.

6ee9db5

sheilaliuxl force-pushed the dev-gc branch from 9705a0a to 6ee9db5 Compare March 13, 2026 17:37

sheilaliuxl changed the title ~~[DRAFT][perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.~~ [perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time. Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.#5549

[perf, training_utils] fix: Run gc regularly in trainer/sft_trainer.py to avoid spikes in step time.#5549
sheilaliuxl wants to merge 2 commits intoverl-project:mainfrom
sheilaliuxl:dev-gc

sheilaliuxl commented Mar 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sheilaliuxl commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

1. Step Time

2. All Metrics

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sheilaliuxl commented Mar 10, 2026 •

edited

Loading

1. `Step Time`