-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
The detector controller frequently fails to create ResourceBindings with "already exists" errors, indicating a race condition where multiple reconciliations attempt to create the same ResourceBinding concurrently.
Problem Description
When creating Deployments, the detector controller reconciles them and creates ResourceBindings using controllerutil.CreateOrUpdate(). However, we observe frequent "already exists" errors:
Error Pattern:
E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists
Frequency: 218 errors in 5 minutes for 3 Deployments (~73 errors per deployment)
Sample Logs
The logs show two reconciliations starting within microseconds and one succeeding while the other fails:
I0118 02:08:37.988786 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1
I0118 02:08:37.988852 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1
I0118 02:08:38.001183 detector.go:502] Create ResourceBinding(default/nginx-deployment-1-deployment) successfully.
E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists
Why CreateOrUpdate Should Not Return AlreadyExists
The controllerutil.CreateOrUpdate() function internally performs:
- GET the resource
- If NotFound → CREATE
- If exists → UPDATE
Normal (non-concurrent) behavior:
- GET → NotFound → CREATE → Success ✅
- GET → Found → UPDATE → Success ✅
The AlreadyExists error can ONLY occur if:
Reconciliation A: GET → NotFound
Reconciliation B: GET → NotFound ← Happens concurrently
Reconciliation A: CREATE → Success ✅
Reconciliation B: CREATE → AlreadyExists ❌ ← Race condition!
This indicates that two reconciliations are running concurrently for the same Deployment.
Root Cause
The race window exists in the ApplyPolicy function:
File: pkg/detector/detector.go:481-517
err = retry.RetryOnConflict(retry.DefaultRetry, func() (err error) {
operationResult, err = controllerutil.CreateOrUpdate(context.TODO(), d.Client, bindingCopy, func() error {
// ... update binding fields ...
})
return err
})The problem:
- Both Conflict and AlreadyExists are HTTP 409 errors, but with different
StatusReasonvalues:- Conflict:
Code=409, Reason="Conflict"(stale resourceVersion on UPDATE) - AlreadyExists:
Code=409, Reason="AlreadyExists"(duplicate CREATE)
- Conflict:
retry.RetryOnConflictonly retries on errors withStatusReason="Conflict"- It does NOT retry on errors with
StatusReason="AlreadyExists" - When concurrent reconciliations race to CREATE, one fails with AlreadyExists and the error propagates
Proposed Fix
Replace retry.RetryOnConflict with retry.OnError that handles both Conflict and AlreadyExists:
err = retry.OnError(retry.DefaultRetry, func(err error) bool {
// Retry on both Conflict and AlreadyExists errors
return apierrors.IsConflict(err) || apierrors.IsAlreadyExists(err)
}, func() (err error) {
...
return nil
})
return err
})Affected Code Locations
The same pattern needs to be fixed in 3 locations:
detector.go:481-517-ApplyPolicy()for PropagationPolicy → ResourceBindingdetector.go:573-602-ApplyClusterPolicy()for ClusterPropagationPolicy → ResourceBinding (namespaced)detector.go:624-651-ApplyClusterPolicy()for ClusterPropagationPolicy → ClusterResourceBinding (cluster-scoped)
SLO Impact
This issue directly affects the policy-apply-availability SLO through the metrics.ObserveApplyPolicyAttemptAndLatency() metric recorded at detector.go:452:
defer func() {
metrics.ObserveApplyPolicyAttemptAndLatency(err, start)
if err != nil {
d.EventRecorder.Eventf(object, corev1.EventTypeWarning, events.EventReasonApplyPolicyFailed, ...)
}
}()Each AlreadyExists error is counted as a failed policy application attempt, causing:
SLO Violation:
- Total attempts: 345 (in 5 minutes)
- Failures: 218 AlreadyExists errors (63.2%)
- Success rate: 36.8%
- SLO target: 99.9%
- Status: ❌ VIOLATED (63% failure rate vs 0.1% threshold)
Prometheus query:
# Policy apply failure rate
sum (rate(policy_apply_attempts_total{esult="error"}[5m])
/
sum(rate(policy_apply_attempts_total[5m]))
# Shows: 0.632 (63.2%) - CRITICAL VIOLATION
Expected Impact
Before fix:
- AlreadyExists errors: 218 in 5 minutes
- Policy apply success rate: 36.8%
- SLO status: ❌ VIOLATED (63% failure rate)
- User impact: Warning events, failed reconciliations
After fix:
- AlreadyExists errors: 0 (handled gracefully via retry)
- Policy apply success rate: 99.9%+
- SLO status: ✅ COMPLIANT (meets 99.9% target)
- User impact: Transparent retry, clean success
Metadata
Metadata
Assignees
Labels
Type
Projects
Status