Skip to content

[Bug] OOM and Worker Timeout during GRPO Training with Agent Tool Use #148

@james-yw

Description

@james-yw

System Info:

  • Framework: verl / verl-tool
  • Model: Qwen2.5-Math-1.5B
  • Hardware: 8 *A100 80G, ~1TB RAM
  • Algorithm: GRPO
  • Ray Version: 2.53.0
  • vllm Version: 0.11.0

Description:
The training process consistently crashes after ~74-119 steps, 4-6 hours.

Case 1 (OOM):
Ray kills workers due to memory pressure. System RAM usage hits >95%.

  • ray::KernelActor.execute consuming up to 138GB.
  • FSDP param/optimizer offloading is enabled.
  • Log snippet: ray.exceptions.RayTaskError(OutOfMemoryError): Task was killed due to the node running low on memory.

Case 2 (Timeout):
Frequent Attempt 1 failed due to timeout for traj_id during AgentLoopWorker execution, followed by job termination.

Steps to reproduce:
bash examples/train/math_tir/train_1.5b_grpo.sh
or
bash examples/train/math_tir/train_1.5b_dapo.sh
or
bash examples/train/math_tir/train_1.5b_mt_grpo.sh

Environiment

Package                            Version       Editable project location
---------------------------------- ------------- --------------------------------------------
absl-py                            2.4.0
accelerate                         1.12.0
acecoder                           0.0.1
aiofiles                           25.1.0
aiohappyeyeballs                   2.6.1
aiohttp                            3.13.3
aiohttp-cors                       0.8.1
aiosignal                          1.4.0
annotated-doc                      0.0.4
annotated-types                    0.7.0
anthropic                          0.77.1
antlr4-python3-runtime             4.9.3
anyio                              4.12.1
appdirs                            1.4.4
astor                              0.8.1
asttokens                          3.0.1
async-timeout                      5.0.1
attrs                              25.4.0
audioread                          3.1.0
av                                 16.1.0
beautifulsoup4                     4.14.3
blake3                             1.0.8
bs4                                0.0.2
cachetools                         7.0.0
cbor2                              5.8.0
certifi                            2026.1.4
cffi                               2.0.0
chardet                            5.2.0
charset-normalizer                 3.4.4
click                              8.2.1
cloudpickle                        3.1.2
codetiming                         1.4.0
colorful                           0.5.8
compressed-tensors                 0.11.0
cryptography                       46.0.4
cuda-bindings                      12.9.4
cuda-pathfinder                    1.3.3
cupy-cuda12x                       13.6.0
datasets                           4.5.0
decorator                          5.2.1
depyf                              0.19.0
dill                               0.4.0
diskcache                          5.6.3
distlib                            0.4.0
distro                             1.9.0
dnspython                          2.8.0
docstring-parser                   0.17.0
einops                             0.8.2
email-validator                    2.3.0
evalplus                           0.3.1
exceptiongroup                     1.3.1
executing                          2.2.1
fastapi                            0.128.0
fastapi-cli                        0.0.20
fastapi-cloud-cli                  0.11.0
fastar                             0.8.0
fastrlock                          0.8.3
filelock                           3.20.3
fire                               0.7.1
flash-attn                         2.8.3
frozendict                         2.4.7
frozenlist                         1.8.0
fsspec                             2025.10.0
func-timeout                       4.3.5
gguf                               0.17.1
gitdb                              4.0.12
gitpython                          3.1.46
google-ai-generativelanguage       0.6.15
google-api-core                    2.29.0
google-api-python-client           2.189.0
google-auth                        2.48.0
google-auth-httplib2               0.3.0
google-generativeai                0.8.6
googleapis-common-protos           1.72.0
grpcio                             1.76.0
grpcio-status                      1.71.2
h11                                0.16.0
hf-xet                             1.2.0
httpcore                           1.0.9
httplib2                           0.31.2
httptools                          0.7.1
httpx                              0.28.1
huggingface-hub                    0.36.1
hydra-core                         1.3.2
idna                               3.11
importlib-metadata                 8.7.1
interegular                        0.3.3
ipython                            8.38.0
jedi                               0.19.2
jinja2                             3.1.6
jiter                              0.13.0
joblib                             1.5.3
jsonschema                         4.26.0
jsonschema-specifications          2025.9.1
langid                             1.1.6
lark                               1.2.2
latex2sympy2-extended              1.11.0
lazy-loader                        0.4
librosa                            0.11.0
llguidance                         0.7.30
llvmlite                           0.44.0
lm-format-enforcer                 0.11.3
markdown                           3.10.1
markdown-it-py                     4.0.0
markupsafe                         3.0.3
math-verify                        0.9.0
matplotlib-inline                  0.2.1
mdurl                              0.1.2
mistral-common                     1.9.0
mpmath                             1.3.0
msgpack                            1.1.2
msgspec                            0.20.0
multidict                          6.7.1
multipledispatch                   1.0.0
multiprocess                       0.70.18
networkx                           3.4.2
ninja                              1.13.0
nltk                               3.9.2
numba                              0.61.2
numpy                              1.26.4
nvidia-cublas-cu12                 12.8.4.1
nvidia-cuda-cupti-cu12             12.8.90
nvidia-cuda-nvrtc-cu12             12.8.93
nvidia-cuda-runtime-cu12           12.8.90
nvidia-cudnn-cu12                  9.10.2.21
nvidia-cufft-cu12                  11.3.3.83
nvidia-cufile-cu12                 1.13.1.3
nvidia-curand-cu12                 10.3.9.90
nvidia-cusolver-cu12               11.7.3.90
nvidia-cusparse-cu12               12.5.8.93
nvidia-cusparselt-cu12             0.7.1
nvidia-nccl-cu12                   2.27.3
nvidia-nvjitlink-cu12              12.8.93
nvidia-nvshmem-cu12                3.4.5
nvidia-nvtx-cu12                   12.8.90
omegaconf                          2.3.0
openai                             2.16.0
openai-harmony                     0.0.8
opencensus                         0.11.4
opencensus-context                 0.1.3
opencv-python-headless             4.11.0.86
opentelemetry-api                  1.39.1
opentelemetry-exporter-prometheus  0.60b1
opentelemetry-proto                1.39.1
opentelemetry-sdk                  1.39.1
opentelemetry-semantic-conventions 0.60b1
orjson                             3.11.7
outlines-core                      0.2.11
packaging                          25.0
pandas                             2.3.3
parso                              0.8.5
partial-json-parser                0.2.1.1.post7
pdfminer-six                       20251230
pdfplumber                         0.11.9
peft                               0.18.1
pexpect                            4.9.0
pillow                             12.1.0
platformdirs                       4.5.1
pooch                              1.9.0
prometheus-client                  0.24.1
prometheus-fastapi-instrumentator  7.1.0
prompt-toolkit                     3.0.52
propcache                          0.4.1
proto-plus                         1.27.1
protobuf                           5.29.5
psutil                             7.2.2
psutils                            3.3.11
ptyprocess                         0.7.0
pure-eval                          0.2.3
puremagic                          1.30
py-cpuinfo                         9.0.0
py-spy                             0.4.1
pyarrow                            23.0.0
pyasn1                             0.6.2
pyasn1-modules                     0.4.2
pybase64                           1.4.3
pybind11                           3.0.1
pycountry                          24.6.1
pycparser                          3.0
pydantic                           2.12.5
pydantic-core                      2.41.5
pydantic-extra-types               2.11.0
pydantic-settings                  2.12.0
pyext                              0.7
pygments                           2.19.2
pylatexenc                         2.10
pyparsing                          3.3.2
pypdf                              6.6.2
pypdfium2                          5.3.0
python-dateutil                    2.9.0.post0
python-dotenv                      1.2.1
python-json-logger                 4.0.0
python-multipart                   0.0.22
pytz                               2025.2
pyvers                             0.1.0
pyyaml                             6.0.3
pyzmq                              27.1.0
qwen-omni-utils                    0.0.8
qwen-vl-utils                      0.0.14
ray                                2.53.0
referencing                        0.37.0
regex                              2026.1.15
requests                           2.32.5
rich                               14.3.2
rich-toolkit                       0.18.1
rignore                            0.7.6
rpds-py                            0.30.0
rsa                                4.9.1
safetensors                        0.7.0
scikit-learn                       1.7.2
scipy                              1.15.3
sentencepiece                      0.2.1
sentry-sdk                         2.51.0
setproctitle                       1.3.7
setuptools                         80.10.2
shellingham                        1.5.4
six                                1.17.0
smart-open                         7.5.0
smmap                              5.0.2
sniffio                            1.3.1
soundfile                          0.13.1
soupsieve                          2.8.3
soxr                               1.0.0
stack-data                         0.6.3
starlette                          0.50.0
stop-sequencer                     1.2.3
sympy                              1.14.0
tempdir                            0.7.1
tensorboard                        2.20.0
tensorboard-data-server            0.7.2
tensordict                         0.10.0
termcolor                          3.3.0
threadpoolctl                      3.6.0
tiktoken                           0.12.0
timeout-decorator                  0.5.0
tokenizers                         0.22.2
tomli                              2.4.0
torch                              2.8.0
torchaudio                         2.8.0
torchdata                          0.11.0
torchvision                        0.23.0
tqdm                               4.67.3
traitlets                          5.14.3
transformers                       4.57.6
tree-sitter                        0.25.2
tree-sitter-python                 0.25.0
triton                             3.4.0
typer                              0.21.1
typer-slim                         0.21.1
typing-extensions                  4.15.0
typing-inspection                  0.4.2
tzdata                             2025.3
uritemplate                        4.2.0
urllib3                            2.6.3
uvicorn                            0.40.0
uvloop                             0.22.1
verl                               0.7.0.dev0    /home/aiops/yangwen/workspace/verl-tool/verl
verl-tool                          0.1.0         /home/aiops/yangwen/workspace/verl-tool
virtualenv                         20.36.1
vllm                               0.11.0
wandb                              0.24.1
watchfiles                         1.1.1
wcwidth                            0.5.3
websockets                         16.0
werkzeug                           3.1.5
wget                               3.2
wrapt                              2.1.1
xformers                           0.0.32.post1
xgrammar                           0.1.25
xxhash                             3.6.0
yarl                               1.22.0
zipp                               3.23.0

Log file:

20260204_160017_train_1.5b_grpo.log
20260204_174243_train_1.5b_dapo.log
20260204_175555_train_1.5b_mt_grpo.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions