Skip to content

snap auto-refresh breaks cluster #1022

@eug48

Description

@eug48

This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.

Otherwise microk8s is working rather well so thank you very much.

More details about the outage:

kubectl get nodes
NAME           STATUS     ROLES    AGE   VERSION
10.aa.aa.aaa   Ready      <none>   38d   v1.17.3
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   38d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
aaa-master     Ready      <none>   59d   v1.17.3

microk8s is disabled..

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  disabled,classic
root@wk3:/home# snap changes microk8s
ID   Status  Spawn                Ready  Summary
20   Doing   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

Data copy appears hanged

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 09:56 AEDT  Ensure prerequisites for "microk8s" are available
Done    today at 09:56 AEDT  today at 09:56 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 09:56 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Mount snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Run pre-refresh hook of "microk8s" snap if present
Done    today at 09:56 AEDT  today at 09:57 AEDT  Stop snap "microk8s" services
Done    today at 09:56 AEDT  today at 09:57 AEDT  Remove aliases for snap "microk8s"
Done    today at 09:56 AEDT  today at 09:57 AEDT  Make current revision for snap "microk8s" unavailable
Doing   today at 09:56 AEDT  -                    Copy snap "microk8s" data
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" (1254) security profiles
Do      today at 09:56 AEDT  -                    Make snap "microk8s" (1254) available to the system
Do      today at 09:56 AEDT  -                    Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 09:56 AEDT  -                    Set automatic aliases for snap "microk8s"
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" aliases
Do      today at 09:56 AEDT  -                    Run post-refresh hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Start snap "microk8s" (1254) services
Do      today at 09:56 AEDT  -                    Clean up "microk8s" (1254) install
Do      today at 09:56 AEDT  -                    Run configure hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Run health check of "microk8s" snap
Doing   today at 09:56 AEDT  -                    Consider re-refresh of "microk8s"

There doesn't seem to be much to copy anyway:

root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M	/var/lib/snapd/
74G	/var/snap/
2.0G	/snap

root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M	/snap/microk8s/1176
737M	/snap/microk8s/1254

root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K	/var/snap/microk8s/1176
74G	/var/snap/microk8s/common

Starting microk8s fails

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

Fails to abort..

root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID   Status  Spawn                Ready  Summary
20   Abort   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

snapd service hangs when trying to stop it...

root@wk2 ~# systemctl stop snapd.service
(hangs)

have to resort to manually stopping the process

killall snapd

finally change is undone..

root@wk3:/home# snap changes
ID   Status  Spawn                Ready                Summary
20   Undone  today at 09:56 AEDT  today at 10:41 AEDT  Auto-refresh snap "microk8s"

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 10:41 AEDT  Ensure prerequisites for "microk8s" are available
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 10:41 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Mount snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Run pre-refresh hook of "microk8s" snap if present
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Stop snap "microk8s" services
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Remove aliases for snap "microk8s"
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Make current revision for snap "microk8s" unavailable
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Copy snap "microk8s" data
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" (1254) security profiles
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Make snap "microk8s" (1254) available to the system
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Automatically connect eligible plugs and slots of snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Set automatic aliases for snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" aliases
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run post-refresh hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Start snap "microk8s" (1254) services
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Clean up "microk8s" (1254) install
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run configure hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run health check of "microk8s" snap
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Consider re-refresh of "microk8s

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  classic

Nothing much in snapd logs except for a polkit error - unsure if related:

root@wk3:/home# journalctl -b -u snapd.service

...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions