Skip to content

[Core/UX] Improve the display of returncode for multi-node #4232

Open
@Michaelvll

Description

@Michaelvll

When a user's job is running on multiple nodes and one node fails with a return code, e.g. 1, SkyPilot will kill the processes on the other nodes, with a return code 137. It is confusing to users to see a list of return code like the following: ERROR: Job 1 failed with return code list: [1, 137, 137]

Instead, we should show message like the following:

ERROR: Job 1 failed with returncode: 1 on one node worker-2, SkyPilot cleaned the processes on other nodes with returncode 137

Version & Commit info:

  • sky -v: PLEASE_FILL_IN
  • sky -c: PLEASE_FILL_IN

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions