[Observability] Detector Controller Concurrent Reconciliation Issues Trigger SLO violations

## Summary

The detector controller frequently fails to create ResourceBindings with "already exists" errors, indicating a race condition where multiple reconciliations attempt to create the same ResourceBinding concurrently.

## Problem Description

When creating Deployments, the detector controller reconciles them and creates ResourceBindings using `controllerutil.CreateOrUpdate()`. However, we observe frequent "already exists" errors:

**Error Pattern**:
```
E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists
```

**Frequency**: 218 errors in 5 minutes for 3 Deployments (~73 errors per deployment)

## Sample Logs

The logs show two reconciliations starting within microseconds and one succeeding while the other fails:

```
I0118 02:08:37.988786 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1
I0118 02:08:37.988852 detector.go:242] Reconciling object: apps/v1, kind=Deployment, default/nginx-deployment-1 

I0118 02:08:38.001183 detector.go:502] Create ResourceBinding(default/nginx-deployment-1-deployment) successfully.
E0118 02:08:38.005102 detector.go:496] Failed to apply policy(deployment-policy) for object: apps/v1, kind=Deployment, default/nginx-deployment-1. error: resourcebindings.work.karmada.io "nginx-deployment-1-deployment" already exists
```

## Why CreateOrUpdate Should Not Return AlreadyExists

The `controllerutil.CreateOrUpdate()` function internally performs:
1. **GET** the resource
2. If NotFound → **CREATE**
3. If exists → **UPDATE**

**Normal (non-concurrent) behavior**:
- GET → NotFound → CREATE → Success ✅
- GET → Found → UPDATE → Success ✅

**The AlreadyExists error can ONLY occur if**:
```
Reconciliation A: GET → NotFound
Reconciliation B: GET → NotFound  ← Happens concurrently
Reconciliation A: CREATE → Success ✅
Reconciliation B: CREATE → AlreadyExists ❌  ← Race condition!
```

This indicates that **two reconciliations are running concurrently** for the same Deployment.

## Root Cause

The race window exists in the `ApplyPolicy` function:

**File**: [`pkg/detector/detector.go:481-517`](https://github.com/karmada-io/karmada/blob/release-1.16/pkg/detector/detector.go#L481-L517)

```go
err = retry.RetryOnConflict(retry.DefaultRetry, func() (err error) {
    operationResult, err = controllerutil.CreateOrUpdate(context.TODO(), d.Client, bindingCopy, func() error {
        // ... update binding fields ...
    })
    return err
})
```

**The problem**:
- Both Conflict and AlreadyExists are **HTTP 409** errors, but with different `StatusReason` values:
  - Conflict: `Code=409, Reason="Conflict"` (stale resourceVersion on UPDATE)
  - AlreadyExists: `Code=409, Reason="AlreadyExists"` (duplicate CREATE)
- `retry.RetryOnConflict` only retries on errors with `StatusReason="Conflict"`
- It does NOT retry on errors with `StatusReason="AlreadyExists"`
- When concurrent reconciliations race to CREATE, one fails with AlreadyExists and the error propagates

## Proposed Fix

Replace `retry.RetryOnConflict` with `retry.OnError` that handles both Conflict and AlreadyExists:

```go
err = retry.OnError(retry.DefaultRetry, func(err error) bool {
    // Retry on both Conflict and AlreadyExists errors
    return apierrors.IsConflict(err) || apierrors.IsAlreadyExists(err)
}, func() (err error) {
   ...

        return nil
    })
    return err
})
```

## Affected Code Locations

The same pattern needs to be fixed in **3 locations**:

1. [`detector.go:481-517`](https://github.com/karmada-io/karmada/blob/release-1.16/pkg/detector/detector.go#L481-L517) - `ApplyPolicy()` for PropagationPolicy → ResourceBinding
2. [`detector.go:573-602`](https://github.com/karmada-io/karmada/blob/release-1.16/pkg/detector/detector.go#L573-L602) - `ApplyClusterPolicy()` for ClusterPropagationPolicy → ResourceBinding (namespaced)
3. [`detector.go:624-651`](https://github.com/karmada-io/karmada/blob/release-1.16/pkg/detector/detector.go#L624-L651) - `ApplyClusterPolicy()` for ClusterPropagationPolicy → ClusterResourceBinding (cluster-scoped)

## SLO Impact

This issue directly affects the **policy-apply-availability** SLO through the `metrics.ObserveApplyPolicyAttemptAndLatency()` metric recorded at [`detector.go:452`](https://github.com/karmada-io/karmada/blob/release-1.16/pkg/detector/detector.go#L452):

```go
defer func() {
    metrics.ObserveApplyPolicyAttemptAndLatency(err, start)
    if err != nil {
        d.EventRecorder.Eventf(object, corev1.EventTypeWarning, events.EventReasonApplyPolicyFailed, ...)
    }
}()
```

Each AlreadyExists error is counted as a **failed policy application attempt**, causing:

**SLO Violation**:
- **Total attempts**: 345 (in 5 minutes)
- **Failures**: 218 AlreadyExists errors (63.2%)
- **Success rate**: 36.8%
- **SLO target**: 99.9%
- **Status**: ❌ VIOLATED (63% failure rate vs 0.1% threshold)

**Prometheus query**:
```promql
# Policy apply failure rate
sum (rate(policy_apply_attempts_total{esult="error"}[5m])
/
sum(rate(policy_apply_attempts_total[5m]))

# Shows: 0.632 (63.2%) - CRITICAL VIOLATION
```

## Expected Impact

**Before fix**:
- AlreadyExists errors: 218 in 5 minutes
- Policy apply success rate: 36.8%
- SLO status: ❌ VIOLATED (63% failure rate)
- User impact: Warning events, failed reconciliations

**After fix**:
- AlreadyExists errors: 0 (handled gracefully via retry)
- Policy apply success rate: 99.9%+
- SLO status: ✅ COMPLIANT (meets 99.9% target)
- User impact: Transparent retry, clean success



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Detector Controller Concurrent Reconciliation Issues Trigger SLO violations #7120

Summary

Problem Description

Sample Logs

Why CreateOrUpdate Should Not Return AlreadyExists

Root Cause

Proposed Fix

Affected Code Locations

SLO Impact

Expected Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Observability] Detector Controller Concurrent Reconciliation Issues Trigger SLO violations #7120

Description

Summary

Problem Description

Sample Logs

Why CreateOrUpdate Should Not Return AlreadyExists

Root Cause

Proposed Fix

Affected Code Locations

SLO Impact

Expected Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions