Skip to content

Print error to job log when instance is shut down prematurely #295

Open
@sarahhodne

Description

If a VM is shut down (either because of shutdown command in the script, or gcloud-cleanup cleaning it up, or some other reason), the job appears to just suddenly end with no information about why. We should probably print an error to the logs in this case. Here's an example log output from a worker where this happened (gcloud-cleanup cleaned up the VM since this repository had a timeout that's longer than gcloud-cleanup's timeout).

time="2017-03-22T18:52:50Z" level=info msg="received amqp delivery" job=[reacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c  
time="2017-03-22T18:52:50Z" level=info msg="starting job" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="generated script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="starting instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="starting instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:12Z" level=info msg="started instance" boot_time=21.853888377s job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:52Z" level=info msg="uploaded script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:52Z" level=info msg="running script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=error msg="couldn't run script" completed=true err="error running script: error running script: wait: remote command exited without exit status or exit signal" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="finished script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="deleting instance" instance=[redacted] job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="stopped instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="finished job" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  

The exact error is this one:

time="2017-03-22T21:52:58Z" level=error msg="couldn't run script" completed=true err="error running script: error running script: wait: remote command exited without exit status or exit signal" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  

I'm a little confused about this error too—looking at backend/gce.RunScript I don't see any way for Completed to be true while also returning an error…so there's something else odd going on here too…

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    backend/gceapplies to the gce backend providerbug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions