Description
I’m trying to run the train_ppo_qwen_base_math_lv35_1_node.sh script on a node with 8 A800 GPUs, but I keep getting a RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable. It’s strange because I can run other tasks just fine. Can you help me figure out what’s going on? Thanks a bunch! Here’s the error message I’m getting:
$ ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{
"pip": ["ray==2.12.0", "latex2sympy2", "timeout_decorator"]
}' -- /bin/bash /research/d1/gds/ytyang/kwchen/simpleRL-reason/train/examples/script/train_ppo_qwen_base_math_lv35_1_node.sh
Job submission server address: http://127.0.0.1:8265
Job 'raysubmit_chztKAY2bCTz8zMz' submitted successfully
Next steps
Query the logs of the job:
ray job logs raysubmit_chztKAY2bCTz8zMz
Query the status of the job:
ray job status raysubmit_chztKAY2bCTz8zMz
Request the job to be stopped:
ray job stop raysubmit_chztKAY2bCTz8zMz
Tailing logs until the job exits (disable with --no-wait):
[2025-02-11 10:35:40,750] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
2025-02-11 10:36:06,877 INFO worker.py:1429 -- Using address 192.168.50.187:6379 set in the environment variable RAY_ADDRESS
2025-02-11 10:36:06,877 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 192.168.50.187:6379...
2025-02-11 10:36:06,895 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(pid=2013613) [2025-02-11 10:36:13,623] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013613) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013729) [2025-02-11 10:36:24,593] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013728) [2025-02-11 10:36:24,638] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013729) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013937) [2025-02-11 10:36:36,200] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013728) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013938) [2025-02-11 10:36:36,176] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013937) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:42,938] [INFO] [comm.py:652:init_distributed] cdb=None
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:42,939] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=2013938) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Traceback (most recent call last):
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in
train(args)
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 148, in train
ray.get(refs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ReferenceModelRayActor.init_model_from_pretrained() (pid=2013729, ip=192.168.50.187, actor_id=bd89a830d95b2b36d5753bd108000000, repr=<openrlhf.trainer.ray.launcher.ReferenceModelRayActor object at 0x7f87600676a0>)
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/trainer/ray/launcher.py", line 75, in init_model_from_pretrained
model = Actor(
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/models/actor.py", line 65, in init
self.model = AutoModelForCausalLM.from_pretrained(
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4097, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1083, in init
self.model = Qwen2Model(config)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 789, in init
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 144, in init
self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs),
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 240, in wrapped_fn
tensor: Tensor = fn(*args, **kwargs)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:44,327] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
(ActorModelRayActorBOX pid=2013613) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
.
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:44,382] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 0, num_elems = 0.00B
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:42,937] [INFO] [comm.py:652:init_distributed] cdb=None [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:42,937] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:44,335] [INFO] [config.py:733:init] Config mesh_device None world_size = 2 [repeated 3x across cluster]
(ReferenceModelRayActor pid=2013937) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
. [repeated 3x across cluster]
Job 'raysubmit_chztKAY2bCTz8zMz' failed
Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:44,327] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
(ActorModelRayActorBOX pid=2013613) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
.
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:44,382] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 0, num_elems = 0.00B
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:42,937] [INFO] [comm.py:652:init_distributed] cdb=None [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:42,937] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:44,335] [INFO] [config.py:733:init] Config mesh_device None world_size = 2 [repeated 3x across cluster]
(ReferenceModelRayActor pid=2013937) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
. [repeated 3x across cluster]
Activity