-
Notifications
You must be signed in to change notification settings - Fork 199
Bug: Bootstrap taking longer than the startup probe failure threshold #3370
Copy link
Copy link
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.Denotes an issue or PR that has aged beyond stale and will be auto-closed.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.Denotes an issue or PR that has aged beyond stale and will be auto-closed.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
The startup probe configuration is currently hardcoded in the operator and cannot be customized through the ScyllaCluster CRD. This causes issues during node replacement operations on large clusters where the default timeout (~6.7 minutes) is insufficient.
Current Behavior
The startup probe is hardcoded in
pkg/controller/scylladbdatacenter/resource.go:Problem
During node replacement (
--replace-node-first-boot) on clusters with large data volumes (e.g., 17TB+), the node needs to:This process can take significantly longer than 6.7 minutes, causing the startup probe to fail and the pod to restart repeatedly in a crash loop.
Proposed Solution
Expose startup probe configuration in the ScyllaCluster CRD, similar to how
resourcesand other pod-level settings are configurable. For example:Or at the cluster level:
Workaround
Currently, the only workaround is to manually patch the StatefulSet after deployment:
This is error-prone and may be overwritten by the operator during reconciliation.
Environment
Additional Context
Related issue: #844 (probe timeouts during overload)
Thank you for considering this feature request!