Skip to content

Unavailable Cuda Devices #33

Open
Open
@kaiwenKevinn

Description

@kaiwenKevinn

I’m trying to run the train_ppo_qwen_base_math_lv35_1_node.sh script on a node with 8 A800 GPUs, but I keep getting a RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable. It’s strange because I can run other tasks just fine. Can you help me figure out what’s going on? Thanks a bunch! Here’s the error message I’m getting:

$ ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{
"pip": ["ray==2.12.0", "latex2sympy2", "timeout_decorator"]
}' -- /bin/bash /research/d1/gds/ytyang/kwchen/simpleRL-reason/train/examples/script/train_ppo_qwen_base_math_lv35_1_node.sh
Job submission server address: http://127.0.0.1:8265


Job 'raysubmit_chztKAY2bCTz8zMz' submitted successfully

Next steps
Query the logs of the job:
ray job logs raysubmit_chztKAY2bCTz8zMz
Query the status of the job:
ray job status raysubmit_chztKAY2bCTz8zMz
Request the job to be stopped:
ray job stop raysubmit_chztKAY2bCTz8zMz

Tailing logs until the job exits (disable with --no-wait):
[2025-02-11 10:35:40,750] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
2025-02-11 10:36:06,877 INFO worker.py:1429 -- Using address 192.168.50.187:6379 set in the environment variable RAY_ADDRESS
2025-02-11 10:36:06,877 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 192.168.50.187:6379...
2025-02-11 10:36:06,895 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(pid=2013613) [2025-02-11 10:36:13,623] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013613) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013729) [2025-02-11 10:36:24,593] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013728) [2025-02-11 10:36:24,638] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013729) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013937) [2025-02-11 10:36:36,200] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013728) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(pid=2013938) [2025-02-11 10:36:36,176] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=2013937) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:42,938] [INFO] [comm.py:652:init_distributed] cdb=None
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:42,939] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=2013938) Warning: The default cache directory for DeepSpeed Triton autotune, /uac/gds/ytyang/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Traceback (most recent call last):
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in
train(args)
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 148, in train
ray.get(refs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ReferenceModelRayActor.init_model_from_pretrained() (pid=2013729, ip=192.168.50.187, actor_id=bd89a830d95b2b36d5753bd108000000, repr=<openrlhf.trainer.ray.launcher.ReferenceModelRayActor object at 0x7f87600676a0>)
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/trainer/ray/launcher.py", line 75, in init_model_from_pretrained
model = Actor(
File "/research/d1/gds/ytyang/kwchen/simpleRL-reason/train/openrlhf/models/actor.py", line 65, in init
self.model = AutoModelForCausalLM.from_pretrained(
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4097, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1083, in init
self.model = Qwen2Model(config)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 789, in init
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
f(module, *args, **kwargs)
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 144, in init
self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs),
File "/research/d1/gds/ytyang/anaconda3/envs/open_reasoner/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 240, in wrapped_fn
tensor: Tensor = fn(*args, **kwargs)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:44,327] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
(ActorModelRayActorBOX pid=2013613) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:44,382] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 0, num_elems = 0.00B
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:42,937] [INFO] [comm.py:652:init_distributed] cdb=None [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:42,937] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:44,335] [INFO] [config.py:733:init] Config mesh_device None world_size = 2 [repeated 3x across cluster]
(ReferenceModelRayActor pid=2013937) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda'). [repeated 3x across cluster]


Job 'raysubmit_chztKAY2bCTz8zMz' failed

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(ActorModelRayActorBOX pid=2013613) [2025-02-11 10:36:44,327] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
(ActorModelRayActorBOX pid=2013613) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:44,382] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 0, num_elems = 0.00B
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:42,937] [INFO] [comm.py:652:init_distributed] cdb=None [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReferenceModelRayActor pid=2013729) [2025-02-11 10:36:42,937] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ReferenceModelRayActor pid=2013937) [2025-02-11 10:36:44,335] [INFO] [config.py:733:init] Config mesh_device None world_size = 2 [repeated 3x across cluster]
(ReferenceModelRayActor pid=2013937) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda'). [repeated 3x across cluster]

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions