Skip to content

The program was killed unexpectedly #38

Open
@mynewstart

Description

@mynewstart

Hi,

I was running the code on 2 A100(80GB) nodes. I used --ref_num_gpus_per_node 4 --critic_num_gpus_per_node 4 --actor_num_gpus_per_node 4 --vllm_num_engines 4. But I encountered the following error when the program had been running for two more days.
Any insights about this issue?

The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are som
e potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or
 other unexpected errors.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffef198b
110f1b3efb8333a4fe04000000 Worker ID: c237cd74d17618f81305f8f732e5b0a799ce11de53af69c921f07a59 Node ID: 4d3b63937bd85e078738478cc923e6247d68a8ac7aa6c2b78c093a39 Worker IP address: 172.31.137
.112 Worker port: 10023 Worker PID: 59378 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential r
oot causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpe
cted errors.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions