Skip to content

cass-operator seed handling issue - endless loop #911

@auzunhan

Description

@auzunhan

What happened?

When we update k8ssandra-operator or cass-management-api version, the first node that has just started rolling a restart gets stuck in 1/2 status(DOWN). Because it doesn't have cassandra.datastax.com/seed-node=true label. Also, I can't see its address in the seed endpoints list. But if I add this label to the node manually like this:
kubectl label pod abc-eu-central-eu-central-1b-sts-0 cassandra.datastax.com/seed-node=true --overwrite
then it can be UP itself after a short time.
cass-operator logs show us its in the endless loop for seed handling by grep "seed". Let me show you one line:

2026-03-30T12:24:10.405Z INFO calling Management API reload seeds - POST /api/v0/ops/seeds/reload {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"eu-central","namespace":"default"}, "namespace": "default", "name": "eu-central", "reconcileID": "874f0571-e3bf-4342-93fb-944dbf2cc1d4", "pod": "abc-eu-central-eu-central-1a-sts-0"} 2026-03-30T12:24:10.437Z INFO calling Management API reload seeds - POST /api/v0/ops/seeds/reload {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"eu-central","namespace":"default"}, "namespace": "default", "name": "eu-central", "reconcileID": "874f0571-e3bf-4342-93fb-944dbf2cc1d4", "pod": "abc-eu-central-eu-central-1c-sts-0"}

as you see there is no node "abc-eu-central-eu-central-1b-sts-0"

This is the K8ssandraCluster CRD manifest:

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: k8ss-cluster
  namespace: default
  #annotations:
    #kustomize.toolkit.fluxcd.io/prune: disabled
spec:
  auth: false
  reaper:
    #cassandra-jaas.config is only required if remote JMX authentication is enabled, but it is no longer used by default between k8ssandra-operator 1.11 and 1.26.
    #Enabling HTTP management removes the need for cassandra-jaas.config and allows pods to run with a read-only root filesystem.
    httpManagement:
      enabled: true
    containerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 4.1.1
    initContainerImage:
      name: cassandra-reaper
      registry: docker.io
      repository: thelastpickle
      tag: 4.1.1
    autoScheduling:
      enabled: true
  cassandra:
    telemetry: 
      prometheus:
        enabled: true
      vector:
        enabled: true
      mcac:
        enabled: false
    clusterName: "ABC DEV"
    serverVersion: "4.1.10"
    resources:
      requests:
        memory: "6G"
        cpu: "1"
      limits:
        memory: "8G"
    datacenters:
      - metadata:
          name: eu-central
        size: 3
        racks:
        - name: eu-central-1a
          nodeAffinityLabels:
            topology.kubernetes.io/zone: eu-central-1a
        - name: eu-central-1b
          nodeAffinityLabels:
            topology.kubernetes.io/zone: eu-central-1b
        - name: eu-central-1c
          nodeAffinityLabels:
            topology.kubernetes.io/zone: eu-central-1c
        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: gp3-retain
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 400Gi
        config:
          cassandraYaml:
            batch_size_fail_threshold: 5000KiB
            batch_size_warn_threshold: 1000KiB
            num_tokens: 256
            partitioner: org.apache.cassandra.dht.RandomPartitioner
          jvmOptions:
            gc: ZGC
            heapSize: 4048M
    serviceAccount: k8ssandra
 

K8ssandra-operator manifest:

---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: k8ssandra
  namespace: default
spec:
  interval: 1m0s
  url: https://helm.k8ssandra.io/stable
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: k8ssandra-operator
  namespace: default
spec:
  chart:
    spec:
      chart: k8ssandra-operator
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: k8ssandra
        namespace: default
      version: 1.30.2
  interval: 1m0s

What did you expect to happen?

I expect to see the node handled by the operator added to the seed endpoint list and come UP. We should not have to add a manual.

How can we reproduce it (as minimally and precisely as possible)?

the operator should handle the nodes. We should not have to add a manual like kubectl label pod abc-eu-central-eu-central-1b-sts-0 cassandra.datastax.com/seed-node=true --overwrite

cass-operator version

image: docker.io/k8ssandra/cass-operator:v1.28.1

Kubernetes version

Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.33.8-eks-3a10415

Method of installation

Kustomize, Helm by k8ssandra-operator

Anything else we need to know?

Did I miss something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions