Open
Description
Component: docs, kubernetes, etcd, systemd, containers, ...
Why this is needed:
Recently, a failed expansion in production led to a very broken cluster, and wiping and reinstalling new machines was out of the question, so we needed a manual clean-up procedure.
Such a procedure doesn't exist in our documentation today: that would have saved both developers and support teams much time to have it somewhere.
What should be done:
Describe procedures for:
- rolling back a failed expansion on a node (remove manifests, certificates, disable services, reboot...)
- resetting a cluster back to bootstrap-stage
- removing a failed etcd member
- troubleshoot
Unauthorized
in kubelet journal (and more examples of logs when something is broken)
Implementation proposal (strongly recommended):
Write all this in a Troubleshooting guide, reference it throughout Installation and Operation guides.