-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Description
System Info:
- Framework: verl / verl-tool
- Model: Qwen2.5-Math-1.5B
- Hardware: 8 *A100 80G, ~1TB RAM
- Algorithm: GRPO
- Ray Version: 2.53.0
- vllm Version: 0.11.0
Description:
The training process consistently crashes after ~74-119 steps, 4-6 hours.
Case 1 (OOM):
Ray kills workers due to memory pressure. System RAM usage hits >95%.
ray::KernelActor.executeconsuming up to 138GB.- FSDP param/optimizer offloading is enabled.
- Log snippet:
ray.exceptions.RayTaskError(OutOfMemoryError): Task was killed due to the node running low on memory.
Case 2 (Timeout):
Frequent Attempt 1 failed due to timeout for traj_id during AgentLoopWorker execution, followed by job termination.
Steps to reproduce:
bash examples/train/math_tir/train_1.5b_grpo.sh
or
bash examples/train/math_tir/train_1.5b_dapo.sh
or
bash examples/train/math_tir/train_1.5b_mt_grpo.sh
Environiment
Package Version Editable project location
---------------------------------- ------------- --------------------------------------------
absl-py 2.4.0
accelerate 1.12.0
acecoder 0.0.1
aiofiles 25.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.3
aiohttp-cors 0.8.1
aiosignal 1.4.0
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.77.1
antlr4-python3-runtime 4.9.3
anyio 4.12.1
appdirs 1.4.4
astor 0.8.1
asttokens 3.0.1
async-timeout 5.0.1
attrs 25.4.0
audioread 3.1.0
av 16.1.0
beautifulsoup4 4.14.3
blake3 1.0.8
bs4 0.0.2
cachetools 7.0.0
cbor2 5.8.0
certifi 2026.1.4
cffi 2.0.0
chardet 5.2.0
charset-normalizer 3.4.4
click 8.2.1
cloudpickle 3.1.2
codetiming 1.4.0
colorful 0.5.8
compressed-tensors 0.11.0
cryptography 46.0.4
cuda-bindings 12.9.4
cuda-pathfinder 1.3.3
cupy-cuda12x 13.6.0
datasets 4.5.0
decorator 5.2.1
depyf 0.19.0
dill 0.4.0
diskcache 5.6.3
distlib 0.4.0
distro 1.9.0
dnspython 2.8.0
docstring-parser 0.17.0
einops 0.8.2
email-validator 2.3.0
evalplus 0.3.1
exceptiongroup 1.3.1
executing 2.2.1
fastapi 0.128.0
fastapi-cli 0.0.20
fastapi-cloud-cli 0.11.0
fastar 0.8.0
fastrlock 0.8.3
filelock 3.20.3
fire 0.7.1
flash-attn 2.8.3
frozendict 2.4.7
frozenlist 1.8.0
fsspec 2025.10.0
func-timeout 4.3.5
gguf 0.17.1
gitdb 4.0.12
gitpython 3.1.46
google-ai-generativelanguage 0.6.15
google-api-core 2.29.0
google-api-python-client 2.189.0
google-auth 2.48.0
google-auth-httplib2 0.3.0
google-generativeai 0.8.6
googleapis-common-protos 1.72.0
grpcio 1.76.0
grpcio-status 1.71.2
h11 0.16.0
hf-xet 1.2.0
httpcore 1.0.9
httplib2 0.31.2
httptools 0.7.1
httpx 0.28.1
huggingface-hub 0.36.1
hydra-core 1.3.2
idna 3.11
importlib-metadata 8.7.1
interegular 0.3.3
ipython 8.38.0
jedi 0.19.2
jinja2 3.1.6
jiter 0.13.0
joblib 1.5.3
jsonschema 4.26.0
jsonschema-specifications 2025.9.1
langid 1.1.6
lark 1.2.2
latex2sympy2-extended 1.11.0
lazy-loader 0.4
librosa 0.11.0
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.11.3
markdown 3.10.1
markdown-it-py 4.0.0
markupsafe 3.0.3
math-verify 0.9.0
matplotlib-inline 0.2.1
mdurl 0.1.2
mistral-common 1.9.0
mpmath 1.3.0
msgpack 1.1.2
msgspec 0.20.0
multidict 6.7.1
multipledispatch 1.0.0
multiprocess 0.70.18
networkx 3.4.2
ninja 1.13.0
nltk 3.9.2
numba 0.61.2
numpy 1.26.4
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.4.5
nvidia-nvtx-cu12 12.8.90
omegaconf 2.3.0
openai 2.16.0
openai-harmony 0.0.8
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.11.0.86
opentelemetry-api 1.39.1
opentelemetry-exporter-prometheus 0.60b1
opentelemetry-proto 1.39.1
opentelemetry-sdk 1.39.1
opentelemetry-semantic-conventions 0.60b1
orjson 3.11.7
outlines-core 0.2.11
packaging 25.0
pandas 2.3.3
parso 0.8.5
partial-json-parser 0.2.1.1.post7
pdfminer-six 20251230
pdfplumber 0.11.9
peft 0.18.1
pexpect 4.9.0
pillow 12.1.0
platformdirs 4.5.1
pooch 1.9.0
prometheus-client 0.24.1
prometheus-fastapi-instrumentator 7.1.0
prompt-toolkit 3.0.52
propcache 0.4.1
proto-plus 1.27.1
protobuf 5.29.5
psutil 7.2.2
psutils 3.3.11
ptyprocess 0.7.0
pure-eval 0.2.3
puremagic 1.30
py-cpuinfo 9.0.0
py-spy 0.4.1
pyarrow 23.0.0
pyasn1 0.6.2
pyasn1-modules 0.4.2
pybase64 1.4.3
pybind11 3.0.1
pycountry 24.6.1
pycparser 3.0
pydantic 2.12.5
pydantic-core 2.41.5
pydantic-extra-types 2.11.0
pydantic-settings 2.12.0
pyext 0.7
pygments 2.19.2
pylatexenc 2.10
pyparsing 3.3.2
pypdf 6.6.2
pypdfium2 5.3.0
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.22
pytz 2025.2
pyvers 0.1.0
pyyaml 6.0.3
pyzmq 27.1.0
qwen-omni-utils 0.0.8
qwen-vl-utils 0.0.14
ray 2.53.0
referencing 0.37.0
regex 2026.1.15
requests 2.32.5
rich 14.3.2
rich-toolkit 0.18.1
rignore 0.7.6
rpds-py 0.30.0
rsa 4.9.1
safetensors 0.7.0
scikit-learn 1.7.2
scipy 1.15.3
sentencepiece 0.2.1
sentry-sdk 2.51.0
setproctitle 1.3.7
setuptools 80.10.2
shellingham 1.5.4
six 1.17.0
smart-open 7.5.0
smmap 5.0.2
sniffio 1.3.1
soundfile 0.13.1
soupsieve 2.8.3
soxr 1.0.0
stack-data 0.6.3
starlette 0.50.0
stop-sequencer 1.2.3
sympy 1.14.0
tempdir 0.7.1
tensorboard 2.20.0
tensorboard-data-server 0.7.2
tensordict 0.10.0
termcolor 3.3.0
threadpoolctl 3.6.0
tiktoken 0.12.0
timeout-decorator 0.5.0
tokenizers 0.22.2
tomli 2.4.0
torch 2.8.0
torchaudio 2.8.0
torchdata 0.11.0
torchvision 0.23.0
tqdm 4.67.3
traitlets 5.14.3
transformers 4.57.6
tree-sitter 0.25.2
tree-sitter-python 0.25.0
triton 3.4.0
typer 0.21.1
typer-slim 0.21.1
typing-extensions 4.15.0
typing-inspection 0.4.2
tzdata 2025.3
uritemplate 4.2.0
urllib3 2.6.3
uvicorn 0.40.0
uvloop 0.22.1
verl 0.7.0.dev0 /home/aiops/yangwen/workspace/verl-tool/verl
verl-tool 0.1.0 /home/aiops/yangwen/workspace/verl-tool
virtualenv 20.36.1
vllm 0.11.0
wandb 0.24.1
watchfiles 1.1.1
wcwidth 0.5.3
websockets 16.0
werkzeug 3.1.5
wget 3.2
wrapt 2.1.1
xformers 0.0.32.post1
xgrammar 0.1.25
xxhash 3.6.0
yarl 1.22.0
zipp 3.23.0
Log file:
20260204_160017_train_1.5b_grpo.log
20260204_174243_train_1.5b_dapo.log
20260204_175555_train_1.5b_mt_grpo.log
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels