Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1362

asafbennatan · 2024-05-22T07:50:02Z

asafbennatan
May 22, 2024

Description

i have installed the cluster on 1.27 after it was done without doing anything else i have upgraded it to 1.28 (~3:42 UTC)
all nodes were up and running except one ( see screenshot).
i used hetzner console to have a look at that node , terminal is stuck in emergency mode (see screenshot 2)
i pressed 'Enter' and everything started and the node is now online again and upgraded (this was at 6:41 UTC).

upon following the recommendation by the OS looking into journalctl -xb (output_reducted.txt attached) i see that the root cause of the issue as far as i gather is that /boot/writable could not be mounted

any idea why this would happen?

Kube.tf file

## All values are referenced from here - https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/kube.tf.example

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  source = "kube-hetzner/kube-hetzner/hcloud"
  
  hcloud_token = var.hcloud_token

  rancher_install_channel = "latest"
  initial_k3s_channel = "v1.28"

  version = "2.13.5"
  # ssh_port = 2222clear
  base_domain = "${replace(var.app_name,"-",".")}.XXX.XX"
  cluster_name = "${var.app_name}"
  # rancher_hostname = "XX.XX.XX"

  enable_cert_manager = false
  enable_rancher = false
  enable_longhorn = false
  # enable_traefik = false
  enable_klipper_metal_lb = "false"
  control_plane_lb_enable_public_interface = true
  # enable_nginx = true

  load_balancer_disable_public_network = false

  ssh_public_key = file("./ssh-key/id_rsa.pub")

  # For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
  ssh_private_key = file("./ssh-key/id_rsa")
  network_region = "eu-central" # change to `us-east` if location is ash

  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "control-plane-hel1",
      server_type = "cx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "workload-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/pool=workload-agent-cx41"
      ],
      taints      = [],
      count       = 3,
      # longhorn_volume_size = 50
    },
    {
      name        = "longhorn-agent-0",
      server_type = "cx41",
      location    = "nbg1",
      labels = [
        "node.kubernetes.io/server-usage=storage",
        "node.kubernetes.io/pool=longhorn-agent-0"
      ],
      taints      = [],
      count       = 3,
      longhorn_volume_size = 50
    }
  ]

  # * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ### The following values are entirely optional (and can be removed from this if unused)

  # You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
  

  # To use local storage on the nodes, you can enable Longhorn, default is "false".

  # The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
  # longhorn_fstype = "xfs"

  # how many replica volumes should longhorn create (default is 3)
  longhorn_replica_count = 1
  disable_hetzner_csi = false

  kured_options = {
    "concurrency": 3
  }

  # If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".


  # We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
  # as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
  # we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
  # traefik_acme_tls = true
  ingress_controller = "none"  
  automatically_upgrade_os = true

  
  allow_scheduling_on_control_plane = false  
  automatically_upgrade_k3s = true

  cni_plugin = "cilium"
  cilium_version = "v1.15.4"
  cilium_routing_mode = "native"
}

Screenshots

status after upgrade:

stuck at emergency:

output_reducted.txt

Platform

Linux

Answered by asafbennatan

Jul 4, 2024

just found out more info - i think this is related to kured mostly - have opened an issue there which perheps could clarify the issue:
kubereboot/kured#950

View full answer

mysticaltech · 2024-05-23T18:33:55Z

mysticaltech
May 23, 2024
Maintainer

@asafbennatan You are in HA, so just turn off and turn back on the node. Normally it should work.

1 reply

asafbennatan May 29, 2024
Author

hi @mysticaltech , is this a known issue ?
do we know what causes it ?

asafbennatan · 2024-07-02T08:03:04Z

asafbennatan
Jul 2, 2024
Author

@mysticaltech i just discovered this is a more serious issue as this happens when nodes are automatically upgraded ,
do you have any technical info as to why this happens ? out of 8 upgraded 2 of these did not reboot

2 replies

mysticaltech Jul 3, 2024
Maintainer

@asafbennatan I did not hear of any other instance were this is happening, it could have been a temporary fluke on the microos side. Just reboot and hopefully it will not do it again. If not, you can turn automatic upgrade off and do it manually from time to time.

asafbennatan Jul 4, 2024
Author

just found out more info - i think this is related to kured mostly - have opened an issue there which perheps could clarify the issue:
kubereboot/kured#950

Answer selected by mysticaltech

mysticaltech · 2024-07-24T08:42:43Z

mysticaltech
Jul 24, 2024
Maintainer

Thanks @asafbennatan, as suggested in the issue you created, increasing the TTL for kured is the recommended path.

Would you mind sharing the values you used? PR to update the default is also very much welcome please.

3 replies

asafbennatan Jul 24, 2024
Author

i have yet to set it , but i think a 300s default makes sense?

asafbennatan Jul 24, 2024
Author

the current terraform-hcloud-kube-hetzner docs suggest 30m in the example so lets go with it ,
come to think about this if i am not mistaken this includes draining the node which might take time (for example if its a longhorn node)

mysticaltech Jul 24, 2024
Maintainer

makes sense @asafbennatan, thanks for the PR

Uh oh!

Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1362

Uh oh!

Uh oh!

asafbennatan May 22, 2024

Description

Kube.tf file

Screenshots

Platform

Replies: 3 comments · 6 replies

Uh oh!

mysticaltech May 23, 2024 Maintainer

Uh oh!

asafbennatan May 29, 2024 Author

Uh oh!

Uh oh!

asafbennatan Jul 2, 2024 Author

Uh oh!

Uh oh!

mysticaltech Jul 3, 2024 Maintainer

Uh oh!

asafbennatan Jul 4, 2024 Author

Uh oh!

mysticaltech Jul 24, 2024 Maintainer

Uh oh!

asafbennatan Jul 24, 2024 Author

Uh oh!

asafbennatan Jul 24, 2024 Author

Uh oh!

mysticaltech Jul 24, 2024 Maintainer

asafbennatan
May 22, 2024

Replies: 3 comments 6 replies

mysticaltech
May 23, 2024
Maintainer

asafbennatan May 29, 2024
Author

asafbennatan
Jul 2, 2024
Author

mysticaltech Jul 3, 2024
Maintainer

asafbennatan Jul 4, 2024
Author

mysticaltech
Jul 24, 2024
Maintainer

asafbennatan Jul 24, 2024
Author

asafbennatan Jul 24, 2024
Author

mysticaltech Jul 24, 2024
Maintainer