Skip to content

K3s exits due to containerd exiting with failed to recover state: failed to get metadata for stored sandbox error #11973

Closed
@denysvitali

Description

@denysvitali

See also containerd/containerd#10848 (as this what causes it)

Quick fix: merge containerd/containerd#10853 into your containerd fork :)

Environmental Info:
K3s Version:

k3s version v1.32.1+k3s1 (AlpineLinux)
go version go1.23.6

Node(s) CPU architecture, OS, and Version:

  • arm64
  • postmarketOS edge
  • Linux k3s-node-instantnoodlep 6.13.0 #2-postmarketos-qcom-sm8250 SMP Thu Jan 30 04:56:19 UTC 2025 aarch64 GNU/Linux

Cluster Configuration:

Probably not relevant here, but 3 nodes (1 master, 2 workers).
Full details in my blog post

  • dumpling: v1.32.1-k3s1
  • instantnoodlep: v1.32.1-k3s1
  • master-1: v1.32.0-k3s1

Describe the bug:
For some reason containerd started to fail-start on one of my k3s nodes. This seems to be exactly equal to containerd/containerd#10848. Removing /var/lib/rancher/k3s/agent/containerd's directory works for a while (pulling all the images and such) until containerd decides to crash again.

Steps To Reproduce:

N/A

Expected behavior:
containerd doesn't crash

Actual behavior:
containerd crashes`

Additional context / logs:

k3s

time="2025-03-15T12:06:46+01:00" level=info msg="Found ip 192.168.10.13,2a02:XXXXXXX from iface wlan0"
time="2025-03-15T12:06:46+01:00" level=info msg="Starting k3s agent v1.32.1+k3s1 (AlpineLinux)"
time="2025-03-15T12:06:46+01:00" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 192.168.10.11:6443"
time="2025-03-15T12:06:46+01:00" level=info msg="Server 192.168.10.11:6443@STANDBY*->UNCHECKED from add to load balancer k3s-agent-load-balancer"
time="2025-03-15T12:06:46+01:00" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [192.168.10.11:6443] [default: 192.168.10.11:6443]"
time="2025-03-15T12:06:46+01:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [192.168.10.11:6443] [default: 192.168.10.11:6443]"
time="2025-03-15T12:06:46+01:00" level=info msg="Server 192.168.10.11:6443@UNCHECKED*->RECOVERING from successful dial"
time="2025-03-15T12:06:49+01:00" level=warning msg="Host resolv.conf includes loopback or multicast nameservers - kubelet will use autogenerated resolv.conf with nameserver 8.8.8.8"
time="2025-03-15T12:06:49+01:00" level=info msg="Module overlay was already loaded"
time="2025-03-15T12:06:49+01:00" level=info msg="Module nf_conntrack was already loaded"
time="2025-03-15T12:06:49+01:00" level=info msg="Module br_netfilter was already loaded"
time="2025-03-15T12:06:49+01:00" level=info msg="Module iptable_nat was already loaded"
time="2025-03-15T12:06:49+01:00" level=info msg="Module iptable_filter was already loaded"
time="2025-03-15T12:06:49+01:00" level=warning msg="Failed to load kernel module nft-expr-counter with modprobe"
time="2025-03-15T12:06:49+01:00" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
time="2025-03-15T12:06:49+01:00" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
W0315 12:06:49.251883     311 logging.go:55] [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "/run/k3s/containerd/containerd.sock", ServerName: "%2Frun%2Fk3s%2Fcontainerd%2Fcontainerd.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"
time="2025-03-15T12:06:50+01:00" level=info msg="Waiting for containerd startup: rpc error: code = Unavailable desc = server is not initialized yet: unavailable"
time="2025-03-15T12:06:50+01:00" level=error msg="containerd exited: exit status 1"

containerd

time="2025-03-15T12:07:55.654247488+01:00" level=info msg="loading plugin" id=io.containerd.service.v1.tasks-service type=io.containerd.service.v1
time="2025-03-15T12:07:55.654688062+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.containers type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654750927+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.content type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654797854+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.diff type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654852489+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.events type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654896396+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.images type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654945719+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.introspection type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.654987333+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.leases type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.655036032+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.namespaces type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.655090719+01:00" level=info msg="loading plugin" id=io.containerd.sandbox.store.v1.local type=io.containerd.sandbox.store.v1
time="2025-03-15T12:07:55.655136032+01:00" level=info msg="loading plugin" id=io.containerd.cri.v1.images type=io.containerd.cri.v1
time="2025-03-15T12:07:55.655764315+01:00" level=info msg="Get image filesystem path \"/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs\" for snapshotter \"overlayfs\""
time="2025-03-15T12:07:55.655941138+01:00" level=info msg="Start snapshots syncer"
time="2025-03-15T12:07:55.656224472+01:00" level=info msg="loading plugin" id=io.containerd.cri.v1.runtime type=io.containerd.cri.v1
time="2025-03-15T12:07:55.657134370+01:00" level=info msg="starting cri plugin" config="{\"containerd\":{\"defaultRuntimeName\":\"runc\",\"runtimes\":{\"runc\":{\"runtimeType\":\"io.containerd.runc.v2\",\"runtimePath\":\"\",\"PodAnnotations\":null,\"ContainerAnnotations\":null,\"options\":{\"BinaryName\":\"\",\"CriuImagePath\":\"\",\"CriuWorkPath\":\"\",\"IoGid\":0,\"IoUid\":0,\"NoNewKeyring\":false,\"Root\":\"\",\"ShimCgroup\":\"\",\"SystemdCgroup\":false},\"privileged_without_host_devices\":false,\"privileged_without_host_devices_all_devices_allowed\":false,\"baseRuntimeSpec\":\"\",\"cniConfDir\":\"\",\"cniMaxConfNum\":0,\"snapshotter\":\"\",\"sandboxer\":\"podsandbox\",\"io_type\":\"\"}},\"ignoreBlockIONotEnabledErrors\":false,\"ignoreRdtNotEnabledErrors\":false},\"cni\":{\"binDir\":\"/usr/libexec/cni\",\"confDir\":\"/var/lib/rancher/k3s/agent/etc/cni/net.d\",\"maxConfNum\":1,\"setupSerially\":false,\"confTemplate\":\"\",\"ipPref\":\"\",\"useInternalLoopback\":false},\"enableSelinux\":false,\"selinuxCategoryRange\":1024,\"maxContainerLogSize\":16384,\"disableApparmor\":false,\"restrictOOMScoreAdj\":false,\"disableProcMount\":false,\"unsetSeccompProfile\":\"\",\"tolerateMissingHugetlbController\":true,\"disableHugetlbController\":true,\"device_ownership_from_security_context\":false,\"ignoreImageDefinedVolumes\":false,\"netnsMountsUnderStateDir\":false,\"enableUnprivilegedPorts\":true,\"enableUnprivilegedICMP\":true,\"enableCDI\":true,\"cdiSpecDirs\":[\"/etc/cdi\",\"/var/run/cdi\"],\"drainExecSyncIOTimeout\":\"0s\",\"ignoreDeprecationWarnings\":null,\"containerdRootDir\":\"/var/lib/rancher/k3s/agent/containerd\",\"containerdEndpoint\":\"/run/k3s/containerd/containerd.sock\",\"rootDir\":\"/var/lib/rancher/k3s/agent/containerd/io.containerd.grpc.v1.cri\",\"stateDir\":\"/run/k3s/containerd/io.containerd.grpc.v1.cri\"}"
time="2025-03-15T12:07:55.657427236+01:00" level=info msg="loading plugin" id=io.containerd.podsandbox.controller.v1.podsandbox type=io.containerd.podsandbox.controller.v1
time="2025-03-15T12:07:55.657837706+01:00" level=info msg="loading plugin" id=io.containerd.sandbox.controller.v1.shim type=io.containerd.sandbox.controller.v1
time="2025-03-15T12:07:55.658192342+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandbox-controllers type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.658418540+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandboxes type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.658598385+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.snapshots type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.658758125+01:00" level=info msg="loading plugin" id=io.containerd.streaming.v1.manager type=io.containerd.streaming.v1
time="2025-03-15T12:07:55.658905312+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.streaming type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.659028125+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.tasks type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.659148490+01:00" level=info msg="loading plugin" id=io.containerd.transfer.v1.local type=io.containerd.transfer.v1
time="2025-03-15T12:07:55.659320626+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.transfer type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.659444376+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.version type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.659570158+01:00" level=info msg="loading plugin" id=io.containerd.monitor.container.v1.restart type=io.containerd.monitor.container.v1
time="2025-03-15T12:07:55.659786721+01:00" level=info msg="loading plugin" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2025-03-15T12:07:55.661676518+01:00" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2025-03-15T12:07:55.661715893+01:00" level=info msg="loading plugin" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2025-03-15T12:07:55.661753341+01:00" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2025-03-15T12:07:55.661784487+01:00" level=info msg="loading plugin" id=io.containerd.ttrpc.v1.otelttrpc type=io.containerd.ttrpc.v1
time="2025-03-15T12:07:55.661822195+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.healthcheck type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.661863601+01:00" level=info msg="loading plugin" id=io.containerd.nri.v1.nri type=io.containerd.nri.v1
time="2025-03-15T12:07:55.661920997+01:00" level=info msg="runtime interface created"
time="2025-03-15T12:07:55.661940997+01:00" level=info msg="created NRI interface"
time="2025-03-15T12:07:55.661972977+01:00" level=info msg="loading plugin" id=io.containerd.grpc.v1.cri type=io.containerd.grpc.v1
time="2025-03-15T12:07:55.662135321+01:00" level=info msg="Connect containerd service"
time="2025-03-15T12:07:55.662251936+01:00" level=info msg="using experimental NRI integration - disable nri plugin to prevent this"
time="2025-03-15T12:07:55.689402788+01:00" level=info msg="Start subscribing containerd event"
time="2025-03-15T12:07:55.689623101+01:00" level=info msg="Start recovering state"
time="2025-03-15T12:07:55.692493265+01:00" level=info msg=serving... address=/run/k3s/containerd/containerd.sock.ttrpc
time="2025-03-15T12:07:55.692630921+01:00" level=info msg=serving... address=/run/k3s/containerd/containerd.sock
time="2025-03-15T12:07:55.734128270+01:00" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to get metadata for stored sandbox \"7e86d8c2543124b203e2db139a10f31d505ad1abb0a129cee23a1a3ae47e51d2\": not found"

df

Disk seems fine, so this doesn't look like the source of my issue.

Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/sda14p2   14G  2.1G   11G  16% /
/dev/sda23           216G   47G  158G  23% /var/lib/rancher

Metadata

Metadata

Labels

kind/upstream-issueThis issue appears to be caused by an upstream bug

Type

No type

Projects

Status

Done Issue

Relationships

None yet

Development

No branches or pull requests

Issue actions