Skip to content

Bug: Bootstrap taking longer than the startup probe failure threshold #3370

@nkolev-mparticle

Description

@nkolev-mparticle

Summary

The startup probe configuration is currently hardcoded in the operator and cannot be customized through the ScyllaCluster CRD. This causes issues during node replacement operations on large clusters where the default timeout (~6.7 minutes) is insufficient.

Current Behavior

The startup probe is hardcoded in pkg/controller/scylladbdatacenter/resource.go:

StartupProbe: &corev1.Probe{
    TimeoutSeconds:   int32(30),
    FailureThreshold: int32(40),   // 40 × 10s = 400s = ~6.7 min
    PeriodSeconds:    int32(10),
    // ...
}

Problem

During node replacement (--replace-node-first-boot) on clusters with large data volumes (e.g., 17TB+), the node needs to:

  1. Scan and verify existing data directories
  2. Reshape SSTables if needed
  3. Join gossip
  4. Stream data from replicas

This process can take significantly longer than 6.7 minutes, causing the startup probe to fail and the pod to restart repeatedly in a crash loop.

Proposed Solution

Expose startup probe configuration in the ScyllaCluster CRD, similar to how resources and other pod-level settings are configurable. For example:

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
spec:
  datacenter:
    racks:
      - name: rack1
        startupProbe:
          failureThreshold: 360    # 1 hour
          periodSeconds: 10
          timeoutSeconds: 30

Or at the cluster level:

spec:
  startupProbe:
    failureThreshold: 360

Workaround

Currently, the only workaround is to manually patch the StatefulSet after deployment:

kubectl patch statefulset <sts-name> -n <namespace> --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/startupProbe/failureThreshold", "value": 360}]'

This is error-prone and may be overwritten by the operator during reconciliation.

Environment

  • Operator version: v1.18.0
  • Scylla version: 2025.4.5-enterprise
  • Cluster size: 24 nodes, ~17TB per node
  • Use case: Node replacement after hardware failure

Additional Context

Related issue: #844 (probe timeouts during overload)

Thank you for considering this feature request!

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions