Verify bootstrap success and instance health

Instances sometimes fail to bootstrap:

* Start hooks can fail to download [[1](https://github.com/travis-pro/team-blue/issues/704#issuecomment-327759580)]
* SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

* Certain Docker commands hang forever [[2](https://github.com/travis-pro/team-blue/issues/704#issuecomment-327755711)]
* Abusive jobs can hog CPU (to be addressed in https://github.com/travis-ci/worker/pull/366)

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:
* Create a `/tmp/health` directory
* make the cloud init script write results to this directory, e.g. `/tmp/health/cloud-init.ok` if everything completed successfully, `/tmp/health/cloud-init.nok` if any errors were encountered
* Use a cron job to occasionally check the status of required services (`docker`, `travis-worker`) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that `docker` isn't working as expected is to try a command, e.g. `docker ps`, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:
* run `docker ps&`, wait a few seconds, then check if a process with that PID is still running?
* check the modification date on docker log file?

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify bootstrap success and instance health #368

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development