Skip to content

[Observability] Detector Controller Concurrent Reconciliation Issues Trigger SLO violations #7120

@jabellard

Description

@jabellard

Summary

The detector controller frequently fails to create ResourceBindings with "already exists" errors, indicating a race condition where multiple reconciliations attempt to create the same ResourceBinding concurrently.

Problem Description

When creating Deployments, the detector controller reconciles them and creates ResourceBindings using controllerutil.CreateOrUpdate(). However, we observe frequent "already exists" errors:

Error Pattern:

E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists

Frequency: 218 errors in 5 minutes for 3 Deployments (~73 errors per deployment)

Sample Logs

The logs show two reconciliations starting within microseconds and one succeeding while the other fails:

I0118 02:08:37.988786 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1
I0118 02:08:37.988852 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1 

I0118 02:08:38.001183 detector.go:502] Create ResourceBinding(default/nginx-deployment-1-deployment) successfully.
E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists

Why CreateOrUpdate Should Not Return AlreadyExists

The controllerutil.CreateOrUpdate() function internally performs:

  1. GET the resource
  2. If NotFound → CREATE
  3. If exists → UPDATE

Normal (non-concurrent) behavior:

  • GET → NotFound → CREATE → Success ✅
  • GET → Found → UPDATE → Success ✅

The AlreadyExists error can ONLY occur if:

Reconciliation A: GET → NotFound
Reconciliation B: GET → NotFound  ← Happens concurrently
Reconciliation A: CREATE → Success ✅
Reconciliation B: CREATE → AlreadyExists ❌  ← Race condition!

This indicates that two reconciliations are running concurrently for the same Deployment.

Root Cause

The race window exists in the ApplyPolicy function:

File: pkg/detector/detector.go:481-517

err = retry.RetryOnConflict(retry.DefaultRetry, func() (err error) {
    operationResult, err = controllerutil.CreateOrUpdate(context.TODO(), d.Client, bindingCopy, func() error {
        // ... update binding fields ...
    })
    return err
})

The problem:

  • Both Conflict and AlreadyExists are HTTP 409 errors, but with different StatusReason values:
    • Conflict: Code=409, Reason="Conflict" (stale resourceVersion on UPDATE)
    • AlreadyExists: Code=409, Reason="AlreadyExists" (duplicate CREATE)
  • retry.RetryOnConflict only retries on errors with StatusReason="Conflict"
  • It does NOT retry on errors with StatusReason="AlreadyExists"
  • When concurrent reconciliations race to CREATE, one fails with AlreadyExists and the error propagates

Proposed Fix

Replace retry.RetryOnConflict with retry.OnError that handles both Conflict and AlreadyExists:

err = retry.OnError(retry.DefaultRetry, func(err error) bool {
    // Retry on both Conflict and AlreadyExists errors
    return apierrors.IsConflict(err) || apierrors.IsAlreadyExists(err)
}, func() (err error) {
   ...

        return nil
    })
    return err
})

Affected Code Locations

The same pattern needs to be fixed in 3 locations:

  1. detector.go:481-517 - ApplyPolicy() for PropagationPolicy → ResourceBinding
  2. detector.go:573-602 - ApplyClusterPolicy() for ClusterPropagationPolicy → ResourceBinding (namespaced)
  3. detector.go:624-651 - ApplyClusterPolicy() for ClusterPropagationPolicy → ClusterResourceBinding (cluster-scoped)

SLO Impact

This issue directly affects the policy-apply-availability SLO through the metrics.ObserveApplyPolicyAttemptAndLatency() metric recorded at detector.go:452:

defer func() {
    metrics.ObserveApplyPolicyAttemptAndLatency(err, start)
    if err != nil {
        d.EventRecorder.Eventf(object, corev1.EventTypeWarning, events.EventReasonApplyPolicyFailed, ...)
    }
}()

Each AlreadyExists error is counted as a failed policy application attempt, causing:

SLO Violation:

  • Total attempts: 345 (in 5 minutes)
  • Failures: 218 AlreadyExists errors (63.2%)
  • Success rate: 36.8%
  • SLO target: 99.9%
  • Status: ❌ VIOLATED (63% failure rate vs 0.1% threshold)

Prometheus query:

# Policy apply failure rate
sum (rate(policy_apply_attempts_total{esult="error"}[5m])
/
sum(rate(policy_apply_attempts_total[5m]))

# Shows: 0.632 (63.2%) - CRITICAL VIOLATION

Expected Impact

Before fix:

  • AlreadyExists errors: 218 in 5 minutes
  • Policy apply success rate: 36.8%
  • SLO status: ❌ VIOLATED (63% failure rate)
  • User impact: Warning events, failed reconciliations

After fix:

  • AlreadyExists errors: 0 (handled gracefully via retry)
  • Policy apply success rate: 99.9%+
  • SLO status: ✅ COMPLIANT (meets 99.9% target)
  • User impact: Transparent retry, clean success

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions