Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode #1362
-
Descriptioni have installed the cluster on 1.27 after it was done without doing anything else i have upgraded it to 1.28 (~3:42 UTC) upon following the recommendation by the OS looking into any idea why this would happen? Kube.tf file## All values are referenced from here - https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/kube.tf.example
module "kube-hetzner" {
providers = {
hcloud = hcloud
}
source = "kube-hetzner/kube-hetzner/hcloud"
hcloud_token = var.hcloud_token
rancher_install_channel = "latest"
initial_k3s_channel = "v1.28"
version = "2.13.5"
# ssh_port = 2222clear
base_domain = "${replace(var.app_name,"-",".")}.XXX.XX"
cluster_name = "${var.app_name}"
# rancher_hostname = "XX.XX.XX"
enable_cert_manager = false
enable_rancher = false
enable_longhorn = false
# enable_traefik = false
enable_klipper_metal_lb = "false"
control_plane_lb_enable_public_interface = true
# enable_nginx = true
load_balancer_disable_public_network = false
ssh_public_key = file("./ssh-key/id_rsa.pub")
# For more details on SSH see https://github.com/kube-hetzner/kube-hetzner/blob/master/docs/ssh.md
ssh_private_key = file("./ssh-key/id_rsa")
network_region = "eu-central" # change to `us-east` if location is ash
control_plane_nodepools = [
{
name = "control-plane-nbg1",
server_type = "cx21",
location = "nbg1",
labels = [],
taints = [],
count = 2
},
{
name = "control-plane-hel1",
server_type = "cx21",
location = "hel1",
labels = [],
taints = [],
count = 1
}
]
agent_nodepools = [
{
name = "workload-agent-0",
server_type = "cx41",
location = "nbg1",
labels = [
"node.kubernetes.io/pool=workload-agent-cx41"
],
taints = [],
count = 3,
# longhorn_volume_size = 50
},
{
name = "longhorn-agent-0",
server_type = "cx41",
location = "nbg1",
labels = [
"node.kubernetes.io/server-usage=storage",
"node.kubernetes.io/pool=longhorn-agent-0"
],
taints = [],
count = 3,
longhorn_volume_size = 50
}
]
# * LB location and type, the latter will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
load_balancer_type = "lb11"
load_balancer_location = "nbg1"
### The following values are entirely optional (and can be removed from this if unused)
# You can refine a base domain name to be use in this form of nodename.base_domain for setting the reserve dns inside Hetzner
# To use local storage on the nodes, you can enable Longhorn, default is "false".
# The file system type for Longhorn, if enabled (ext4 is the default, otherwise you can choose xfs)
# longhorn_fstype = "xfs"
# how many replica volumes should longhorn create (default is 3)
longhorn_replica_count = 1
disable_hetzner_csi = false
kured_options = {
"concurrency": 3
}
# If you want to disable the Traefik ingress controller, to use the Nginx ingress controller for instance, you can can set this to "false". Default is "true".
# We give you the possibility to use letsencrypt directly with Traefik because it's an easy setup, however it's not optimal,
# as the free version of Traefik causes a little bit of downtime when when the certificates get renewed. For proper SSL management,
# we instead recommend you to use cert-manager, that you can easily deploy with helm; see https://cert-manager.io/.
# traefik_acme_tls = true
ingress_controller = "none"
automatically_upgrade_os = true
allow_scheduling_on_control_plane = false
automatically_upgrade_k3s = true
cni_plugin = "cilium"
cilium_version = "v1.15.4"
cilium_routing_mode = "native"
} Screenshotsstuck at emergency: PlatformLinux |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
@asafbennatan You are in HA, so just turn off and turn back on the node. Normally it should work. |
Beta Was this translation helpful? Give feedback.
-
@mysticaltech i just discovered this is a more serious issue as this happens when nodes are automatically upgraded , |
Beta Was this translation helpful? Give feedback.
-
Thanks @asafbennatan, as suggested in the issue you created, increasing the TTL for kured is the recommended path. Would you mind sharing the values you used? PR to update the default is also very much welcome please. |
Beta Was this translation helpful? Give feedback.
just found out more info - i think this is related to kured mostly - have opened an issue there which perheps could clarify the issue:
kubereboot/kured#950