-
Notifications
You must be signed in to change notification settings - Fork 815
Description
This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.
Otherwise microk8s is working rather well so thank you very much.
More details about the outage:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.aa.aa.aaa Ready <none> 38d v1.17.3
10.aa.aa.aaa NotReady <none> 18d v1.17.2
10.aa.aa.aaa NotReady <none> 38d v1.17.2
10.aa.aa.aaa NotReady <none> 18d v1.17.2
aaa-master Ready <none> 59d v1.17.3
microk8s is disabled..
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ disabled,classic
root@wk3:/home# snap changes microk8s
ID Status Spawn Ready Summary
20 Doing today at 09:56 AEDT - Auto-refresh snap "microk8s"
Data copy appears hanged
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 09:56 AEDT Ensure prerequisites for "microk8s" are available
Done today at 09:56 AEDT today at 09:56 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 09:56 AEDT Fetch and check assertions for snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Mount snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Run pre-refresh hook of "microk8s" snap if present
Done today at 09:56 AEDT today at 09:57 AEDT Stop snap "microk8s" services
Done today at 09:56 AEDT today at 09:57 AEDT Remove aliases for snap "microk8s"
Done today at 09:56 AEDT today at 09:57 AEDT Make current revision for snap "microk8s" unavailable
Doing today at 09:56 AEDT - Copy snap "microk8s" data
Do today at 09:56 AEDT - Setup snap "microk8s" (1254) security profiles
Do today at 09:56 AEDT - Make snap "microk8s" (1254) available to the system
Do today at 09:56 AEDT - Automatically connect eligible plugs and slots of snap "microk8s"
Do today at 09:56 AEDT - Set automatic aliases for snap "microk8s"
Do today at 09:56 AEDT - Setup snap "microk8s" aliases
Do today at 09:56 AEDT - Run post-refresh hook of "microk8s" snap if present
Do today at 09:56 AEDT - Start snap "microk8s" (1254) services
Do today at 09:56 AEDT - Clean up "microk8s" (1254) install
Do today at 09:56 AEDT - Run configure hook of "microk8s" snap if present
Do today at 09:56 AEDT - Run health check of "microk8s" snap
Doing today at 09:56 AEDT - Consider re-refresh of "microk8s"
There doesn't seem to be much to copy anyway:
root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M /var/lib/snapd/
74G /var/snap/
2.0G /snap
root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M /snap/microk8s/1176
737M /snap/microk8s/1254
root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K /var/snap/microk8s/1176
74G /var/snap/microk8s/common
Starting microk8s fails
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
Fails to abort..
root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Abort today at 09:56 AEDT - Auto-refresh snap "microk8s"
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
snapd service hangs when trying to stop it...
root@wk2 ~# systemctl stop snapd.service
(hangs)
have to resort to manually stopping the process
killall snapd
finally change is undone..
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Undone today at 09:56 AEDT today at 10:41 AEDT Auto-refresh snap "microk8s"
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 10:41 AEDT Ensure prerequisites for "microk8s" are available
Undone today at 09:56 AEDT today at 10:41 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 10:41 AEDT Fetch and check assertions for snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Mount snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Run pre-refresh hook of "microk8s" snap if present
Undone today at 09:56 AEDT today at 10:41 AEDT Stop snap "microk8s" services
Undone today at 09:56 AEDT today at 10:41 AEDT Remove aliases for snap "microk8s"
Undone today at 09:56 AEDT today at 10:41 AEDT Make current revision for snap "microk8s" unavailable
Undone today at 09:56 AEDT today at 10:41 AEDT Copy snap "microk8s" data
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" (1254) security profiles
Hold today at 09:56 AEDT today at 10:30 AEDT Make snap "microk8s" (1254) available to the system
Hold today at 09:56 AEDT today at 10:30 AEDT Automatically connect eligible plugs and slots of snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Set automatic aliases for snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" aliases
Hold today at 09:56 AEDT today at 10:30 AEDT Run post-refresh hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Start snap "microk8s" (1254) services
Hold today at 09:56 AEDT today at 10:30 AEDT Clean up "microk8s" (1254) install
Hold today at 09:56 AEDT today at 10:30 AEDT Run configure hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Run health check of "microk8s" snap
Hold today at 09:56 AEDT today at 10:30 AEDT Consider re-refresh of "microk8s
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ classic
Nothing much in snapd logs except for a polkit error - unsure if related:
root@wk3:/home# journalctl -b -u snapd.service
...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.