Open
Description
When a user's job is running on multiple nodes and one node fails with a return code, e.g. 1, SkyPilot will kill the processes on the other nodes, with a return code 137. It is confusing to users to see a list of return code like the following: ERROR: Job 1 failed with return code list: [1, 137, 137]
Instead, we should show message like the following:
ERROR: Job 1 failed with returncode: 1 on one node worker-2, SkyPilot cleaned the processes on other nodes with returncode 137
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN