Skip to content

docs: Provide troubleshooting information and remediation instructions for failed expansions #2603

Open
@gdemonet

Description

@gdemonet

Component: docs, kubernetes, etcd, systemd, containers, ...

Why this is needed:

Recently, a failed expansion in production led to a very broken cluster, and wiping and reinstalling new machines was out of the question, so we needed a manual clean-up procedure.
Such a procedure doesn't exist in our documentation today: that would have saved both developers and support teams much time to have it somewhere.

What should be done:

Describe procedures for:

  • rolling back a failed expansion on a node (remove manifests, certificates, disable services, reboot...)
  • resetting a cluster back to bootstrap-stage
  • removing a failed etcd member
  • troubleshoot Unauthorized in kubelet journal (and more examples of logs when something is broken)

Implementation proposal (strongly recommended):

Write all this in a Troubleshooting guide, reference it throughout Installation and Operation guides.

Metadata

Metadata

Assignees

No one assigned

    Labels

    complexity:mediumSomething that requires one or few days to fixkind:enhancementNew feature or requestpriority:mediumMedium priority issues, should only be postponed if no other optiontopic:deploymentBugs in or enhancements to deployment stagestopic:docsDocumentationtopic:etcdAnything related to etcdtopic:lifecycleIssues related to upgrade or downgrade of MetalK8stopic:operationsOperations-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions