Skip to content

RBAC: nodes resource missing watch verb; causes informer error loop #349

@Verolop

Description

@Verolop

Describe the bug

The kubebuilder RBAC marker for nodes only grants list but not watch. When validateCellTopology in pkg/resolver/validation.go calls r.Client.List(ctx, &nodeList), controller-runtime's cache creates an informer for corev1.Node. Informers require both list and watch to function. The missing watch verb causes the informer to enter a retry loop, spamming the operator logs and blocking reconciliation loops.

The marker at api/v1alpha1/multigrescluster_types.go:44:

// +kubebuilder:rbac:groups="",resources=nodes,verbs=list

Should be:

// +kubebuilder:rbac:groups="",resources=nodes,verbs=list;watch

To Reproduce

  1. Deploy the operator (v0.4.0 or v0.4.1) to an EKS cluster using the generated config/rbac/role.yaml
  2. Create a MultigresCluster with cells that have zone topology (e.g., zone: us-east-1d)
  3. Trigger any reconciliation (scale-up, scale-down, or annotate to force reconcile)
  4. Check operator logs: kubectl logs -n multigres-operator deployment/multigres-operator-controller-manager | grep "nodes is forbidden"

You'll see repeated errors:

nodes is forbidden: User "system:serviceaccount:multigres-operator:multigres-operator-controller-manager" cannot watch resource "nodes" in API group "" at the cluster scope

Scale-down operations may time out because the failed informer blocks the reconciliation loop.

Expected behaviour

The operator should be able to list and watch nodes without RBAC errors. The validateCellTopology function should work correctly, and reconciliation loops should not be blocked by informer failures.

System information

  • Operator version: v0.4.0, v0.4.1
  • Kubernetes: EKS 1.31

Additional context

The validateCellTopology function itself handles the failure gracefully, as it returns nil and skips topology validation if it can't list nodes, but the underlying informer retry loop is the problem. It pollutes logs and can block reconciliation.

Workaround: manually patch the ClusterRole:

kubectl patch clusterrole multigres-operator-manager-role --type=json \
  -p '[{"op":"add","path":"/rules/2/verbs/-","value":"watch"}]'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions