Skip to content

The resource daemonset.apps/speaker went down after upgrading the operator and metallb with the manifest v0.13.11 #379

@elevesque-sfr

Description

@elevesque-sfr

MetalLB Version
operator v0.13.11
metallb v0.13.10

OS : Talos 1.3.7
Kubernetes : 1.24.9
CNI : Cilium 1.12.4

After upgrading from operator v0.13.4/metallb v0.13.5 to operator v0.13.10/metallb v0.13.11, the resource daemonset.apps/speaker went down and restarted after few minutes.

[eric@macross ~]$ kubectl get all
NAME                                                      READY   STATUS             RESTARTS         AGE
pod/controller-db6f6ff7d-zjfcr                            1/1     Running            0                70s
pod/metallb-operator-controller-manager-6fd4d656f-tx2hj   1/1     Running            0                15m
pod/metallb-operator-webhook-server-588bbdf874-g2jsd      1/1     Running            0                2m53s
pod/speaker-2tvk6                                         0/1     CrashLoopBackOff   33 (3m3s ago)    3h36m
pod/speaker-5v2sp                                         0/1     CrashLoopBackOff   33 (2m18s ago)   3h36m
pod/speaker-p7spx                                         0/1     CrashLoopBackOff   33 (3m59s ago)   20h
pod/speaker-wrs8n                                         0/1     CrashLoopBackOff   33 (3m59s ago)   3h37m
pod/speaker-xfj7v                                         0/1     CrashLoopBackOff   33 (3m32s ago)   3h36m

Looking at the logs of one of the pod, errors on get and watch configmaps appears and the speacker pod went down.

W0825 11:41:31.682290       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:31.682339       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:33.520445       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:33.520473       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:39.101431       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:39.101463       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:46.581417       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:46.581469       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:42:03.218915       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:42:03.219009       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
[...]
W0825 11:42:37.744778       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:42:37.744806       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"node","controllerGroup":"","controllerKind":"Node","error":"failed to wait for node caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"service","controllerGroup":"","controllerKind":"Service","error":"failed to wait for service caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"bgppeer","controllerGroup":"metallb.io","controllerKind":"BGPPeer","error":"failed to wait for bgppeer caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"error received after stop sequence was engaged","error":"failed to wait for service caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/internal.go:555"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"error received after stop sequence was engaged","error":"failed to wait for bgppeer caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/internal.go:555"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for caches"}
{"level":"error","ts":"2023-08-25T11:43:30Z","logger":"controller-runtime.source","msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/source/source.go:148\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:582\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:547\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/source/source.go:136"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"caller":"main.go:201","error":"failed to wait for node caches to sync: timed out waiting for cache to be synced","level":"error","msg":"failed to run k8s client","op":"startup","ts":"2023-08-25T11:43:30Z"}

Initial installation and upgrade were both done using the manifest.

As a workaround, we added in the clusterrole metallb-system:speaker the autorization to get/list/watch the resource configmaps.

[eric@macross ~]$ kubectl get clusterrole metallb-system:speaker -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"app":"metallb"},"name":"metallb-system:speaker"},"rules":[{"apiGroups":[""],"resources":["services","endpoints","nodes","namespaces"],"verbs":["get","list","watch"]},{"apiGroups":["discovery.k8s.io"],"resources":["endpointslices"],"verbs":["get","list","watch"]},{"apiGroups":[""],"resources":["events"],"verbs":["create","patch"]},{"apiGroups":["policy"],"resourceNames":["speaker"],"resources":["podsecuritypolicies"],"verbs":["use"]}]}
  creationTimestamp: "2022-09-13T07:16:45Z"
  labels:
    app: metallb
  name: metallb-system:speaker
  resourceVersion: "132426474"
  uid: 12d48a2c-8274-49f7-8e51-aed128a7b112
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - nodes
  - namespaces
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - policy
  resourceNames:
  - speaker
  resources:
  - podsecuritypolicies
  verbs:
  - use

After this modification and a full restart, everything is now working perfectly.

[eric@macross ~]$ kubectl get po -o wide -w
NAME                                                  READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
controller-db6f6ff7d-zjfcr                            1/1     Running   0          24m   10.19.3.207    kw905-vso-pr   <none>           <none>
metallb-operator-controller-manager-6fd4d656f-tx2hj   1/1     Running   0          39m   10.19.3.131    kw905-vso-pr   <none>           <none>
metallb-operator-webhook-server-588bbdf874-g2jsd      1/1     Running   0          26m   10.19.3.208    kw905-vso-pr   <none>           <none>
speaker-5vqsf                                         1/1     Running   0          15m   10.4.205.104   kw902-vso-pr   <none>           <none>
speaker-8jjhv                                         1/1     Running   0          14m   10.4.205.103   kw901-vso-pr   <none>           <none>
speaker-jlz9b                                         1/1     Running   0          15m   10.4.205.107   kw905-vso-pr   <none>           <none>
speaker-jtcxx                                         1/1     Running   0          15m   10.4.205.106   kw904-vso-pr   <none>           <none>
speaker-nlwxq                                         1/1     Running   0          15m   10.4.205.105   kw903-vso-pr   <none>           <none>
[eric@macross ~]$ kubectl logs speaker-jtcxx
[...]
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"service","controllerGroup":"","controllerKind":"Service","worker count":1}
{"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2023-08-25T11:47:09Z"}
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"node","controllerGroup":"","controllerKind":"Node","worker count":1}
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"bgppeer","controllerGroup":"metallb.io","controllerKind":"BGPPeer","worker count":1}
{"caller":"node_controller.go:46","controller":"NodeReconciler","level":"info","start reconcile":"/km901-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"config_controller.go:59","controller":"ConfigReconciler","level":"info","start reconcile":"/kw905-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"node_controller.go:69","controller":"NodeReconciler","end reconcile":"/km901-vso-pr","level":"info","ts":"2023-08-25T11:47:09Z"}
[...]
{"caller":"config_controller.go:59","controller":"ConfigReconciler","level":"info","start reconcile":"/km902-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"speakerlist.go:310","level":"info","msg":"node event - forcing sync","node addr":"10.4.205.105","node event":"NodeJoin","node name":"kw903-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"main.go:374","event":"serviceAnnounced","ips":["10.4.207.211"],"level":"info","msg":"service has IP, announcing","pool":"vip-pool","protocol":"layer2","ts":"2023-08-25T11:47:09Z"}
{"caller":"service_controller_reload.go:104","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2023-08-25T11:47:09Z"}
[...]
{"caller":"speakerlist.go:310","level":"info","msg":"node event - forcing sync","node addr":"10.4.205.103","node event":"NodeJoin","node name":"kw901-vso-pr","ts":"2023-08-25T11:47:40Z"}
{"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2023-08-25T11:47:40Z"}
{"caller":"main.go:418","event":"serviceWithdrawn","ip":["10.4.207.209"],"ips":["10.4.207.209"],"level":"info","msg":"withdrawing service announcement","pool":"vip-pool","protocol":"layer2","reason":"notOwner","ts":"2023-08-25T11:47:40Z"}
{"caller":"main.go:374","event":"serviceAnnounced","ips":["10.4.207.211"],"level":"info","msg":"service has IP, announcing","pool":"vip-pool","protocol":"layer2","ts":"2023-08-25T11:47:40Z"}
{"caller":"service_controller_reload.go:104","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2023-08-25T11:47:40Z"}
[eric@macross ~]$ curl -Is http://argocd.tooling-nms-preprod.valentine.sfr.com/ | head -n 1
HTTP/1.1 200 OK

The diff between the original manifest and the one we used for upgrade.

[eric@macross metallb]$ diff metallb-operator.yaml metallb-operator-0.13.10.yaml 
3587c3587
<           value: quay.io/metallb/speaker:v0.13.9
---
>           value: quay.io/metallb/speaker:v0.13.10
3589c3589
<           value: quay.io/metallb/controller:v0.13.9
---
>           value: quay.io/metallb/controller:v0.13.10
3664c3664
<         image: quay.io/metallb/controller:v0.13.9
---
>         image: quay.io/metallb/controller:v0.13.10
4212a4213
>   - configmaps

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions