Bug: Bootstrap taking longer than the startup probe failure threshold

### Summary

The startup probe configuration is currently hardcoded in the operator and cannot be customized through the ScyllaCluster CRD. This causes issues during node replacement operations on large clusters where the default timeout (~6.7 minutes) is insufficient.

### Current Behavior

The startup probe is hardcoded in `pkg/controller/scylladbdatacenter/resource.go`:

```go
StartupProbe: &corev1.Probe{
    TimeoutSeconds:   int32(30),
    FailureThreshold: int32(40),   // 40 × 10s = 400s = ~6.7 min
    PeriodSeconds:    int32(10),
    // ...
}
```

### Problem

During node replacement (`--replace-node-first-boot`) on clusters with large data volumes (e.g., 17TB+), the node needs to:
1. Scan and verify existing data directories
2. Reshape SSTables if needed
3. Join gossip
4. Stream data from replicas

This process can take significantly longer than 6.7 minutes, causing the startup probe to fail and the pod to restart repeatedly in a crash loop.

### Proposed Solution

Expose startup probe configuration in the ScyllaCluster CRD, similar to how `resources` and other pod-level settings are configurable. For example:

```yaml
apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
spec:
  datacenter:
    racks:
      - name: rack1
        startupProbe:
          failureThreshold: 360    # 1 hour
          periodSeconds: 10
          timeoutSeconds: 30
```

Or at the cluster level:

```yaml
spec:
  startupProbe:
    failureThreshold: 360
```

### Workaround

Currently, the only workaround is to manually patch the StatefulSet after deployment:

```bash
kubectl patch statefulset <sts-name> -n <namespace> --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/startupProbe/failureThreshold", "value": 360}]'
```

This is error-prone and may be overwritten by the operator during reconciliation.

### Environment

- Operator version: v1.18.0
- Scylla version: 2025.4.5-enterprise
- Cluster size: 24 nodes, ~17TB per node
- Use case: Node replacement after hardware failure

### Additional Context

Related issue: #844 (probe timeouts during overload)

Thank you for considering this feature request!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Bootstrap taking longer than the startup probe failure threshold #3370

Summary

Current Behavior

Problem

Proposed Solution

Workaround

Environment

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: Bootstrap taking longer than the startup probe failure threshold #3370

Description

Summary

Current Behavior

Problem

Proposed Solution

Workaround

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions