Open
Description
Instances sometimes fail to bootstrap:
- Start hooks can fail to download [1]
- SSH public keys can fail to download
Instances sometimes become unhealthy in ways that aren't measured by our health checks:
- Certain Docker commands hang forever [2]
- Abusive jobs can hog CPU (to be addressed in Properly allocate CPU sets to containers #366)
We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)
I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:
- Create a
/tmp/health
directory - make the cloud init script write results to this directory, e.g.
/tmp/health/cloud-init.ok
if everything completed successfully,/tmp/health/cloud-init.nok
if any errors were encountered - Use a cron job to occasionally check the status of required services (
docker
,travis-worker
) and take appropriate action (e.g. restarting Docker, imploding the instance)
One problem: The only way I know to confirm that docker
isn't working as expected is to try a command, e.g. docker ps
, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:
- run
docker ps&
, wait a few seconds, then check if a process with that PID is still running? - check the modification date on docker log file?
Thoughts?
Metadata
Assignees
Labels
No labels
Activity