Skip to content

UX: DX: VolumeClaimTemplate overrides without a spec cause permanent reconciliation failures #2023

@jmealo

Description

@jmealo

Describe the bug

  • If the user mistakes the override.statefulSet.spec.volumeClaimTemplates for a merge operation rather than a replace you will have a cluster that cannot reconcile (permanently).
  • If the user omits spec.resources.requests.storage it is interpreted as 0 by the operator
  • The error logged by the operator is: shrinking persistent volumes is not supported
  • The error doesn't aid in debugging this configuration error; and troubleshooting isn't straight forward if you only inspect the StatefulSet and PVC -- you'd need to check the helm output and/or the Cluster CR.

Symptoms:

  • The operator reconciliation loop is continuously failing (every ~15 minutes based on those logs)
  • Any changes to the RabbitMQCluster CR won't be applied (operator can't reconcile)
  • Scaling (adding/removing nodes) would likely fail or behave unexpectedly
  • Helm upgrades might appear successful but some changes won't take effect

Fixes suggested:

  • Implement validation at the CRD level to prevent incomplete VolumeClaimTemplate overrides
  • Make the documentation explicit that override is a replace rather than a merge (yes, this is implied by the name, but, LLMs are gonna LLM, and devs are going to use them 🙃 )
  • Added helpful error messages in the operator logs to aid in troubleshooting configuration errors.

Fixes applied:

Logs

{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "shrinking persistent volumes is not supported",
    "level": "error",
    "msg": "Reconciler error",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:353\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}
{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "hit an error while scaling PVC capacity: shrinking persistent volumes is not supported",
    "level": "error",
    "msg": "Failed to scale PVCs: shrinking persistent volumes is not supported",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:21\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}
{
    "container": "operator",
    "controller": "rabbitmqcluster",
    "controllerGroup": "rabbitmq.com",
    "controllerKind": "RabbitmqCluster",
    "error": "unsupported operation",
    "level": "error",
    "msg": "shrinking persistent volumes is not supported",
    "name": "rabbitmq",
    "namespace": "rabbitmq-system",
    "pod": "rabbitmq-cluster-operator-5f8dc96c76-855k6",
    "reconcileID": "aaa60dae-fb09-4ea9-a10a-9924c4e7da15",
    "service_name": "rabbitmq-cluster-operator",
    "stacktrace": "github.com/rabbitmq/cluster-operator/v2/internal/scaling.PersistenceScaler.Scale\n\t/workspace/internal/scaling/scaling.go:52\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:18\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202",
    "stream": "stderr",
    "ts": "2025-12-09T21:08:26Z"
}

Expected behavior

  • Refuse invalid cluster specs at deploy time, rather than logging errors during reconciliation.
  • Helpful error messages in the case of misconfiguration not caught by CRDs.

Version and environment information

  • RabbitMQ: 4.1.3
  • RabbitMQ Cluster Operator: 2.16.1
  • Kubernetes: 1.33.5
  • Cloud provider or hardware configuration: Azure AKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclosed-staleIssue or PR closed due to long period of inactivitystaleIssue or PR with long period of inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions