Skip to content

Verify bootstrap success and instance health #368

Open
@soulshake

Description

Instances sometimes fail to bootstrap:

  • Start hooks can fail to download [1]
  • SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:

  • Create a /tmp/health directory
  • make the cloud init script write results to this directory, e.g. /tmp/health/cloud-init.ok if everything completed successfully, /tmp/health/cloud-init.nok if any errors were encountered
  • Use a cron job to occasionally check the status of required services (docker, travis-worker) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that docker isn't working as expected is to try a command, e.g. docker ps, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:

  • run docker ps&, wait a few seconds, then check if a process with that PID is still running?
  • check the modification date on docker log file?

Thoughts?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions