-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
What would you like to be added:
The work_sync_workload_duration_seconds metric should not count Kubernetes conflict errors (HTTP 409) as failures. These are expected retriable errors in distributed systems and should not be recorded with result="error".
Why is this needed:
Problem: Conflict Errors Cause False SLO Violations
In our environment, this metric shows 15% error rate (16.23x burn rate) causing critical alerts, but the system is actually healthy. Nearly all "errors" are Kubernetes conflict errors that the controller successfully retries.
Sample Logs
E0115 02:56:14.736986 Failed to update resource(kind=Deployment, default/nginx-deployment)
in cluster test-cluster-region1, err: Operation cannot be fulfilled on
deployments.apps "nginx-deployment": the object has been modified
E0115 02:56:19.000869 Failed to update resource(kind=Deployment, default/nginx-deployment)
in cluster test-cluster-region1, err: Operation cannot be fulfilled on
deployments.apps "nginx-deployment": the object has been modified
[Pattern repeats - controller retries and eventually succeeds]
These are HTTP 409 Conflict errors from Kubernetes optimistic concurrency control - normal, expected, and automatically retried.
Current Metrics
# Error rate: 15.8% (target is 99% success)
sum(rate(work_sync_workload_duration_seconds_count{result="error"}[5m])) /
sum(rate(work_sync_workload_duration_seconds_count[5m]))
# Burn rate: 16.23x (burning error budget 16x faster than allowed)
slo:current_burn_rate:ratio{sloth_slo="work-sync-workload-availability"}
Nearly all these "errors" are conflict errors that successfully retry.
Why This Matters
Conflict errors are not failures:
- Temporary and automatically retried (controller succeeds on next attempt)
- Expected in distributed systems (Kubernetes optimistic concurrency control)
- Not user-facing (workloads eventually sync successfully)
Impact:
- False critical alerts (teams paged for "healthy" system)
- Misleading dashboards (85% availability shown when actual is >99%)
- Alert fatigue and wasted engineering time
Current: Metric measures retry rate ("Did first attempt succeed?")
Expected: Metric should measure availability ("Did workload eventually sync?")
Proposed Change
File: pkg/controllers/execution/execution_controller.go
Skip recording the metric:
func (c *Controller) syncWork(...) (controllerruntime.Result, error) {
start := time.Now()
err := c.syncToClusters(ctx, clusterName, work)
// Don't count conflict errors - they are retriable
...Impact: Error rate 15% → <1%, Burn rate 16x → <1x
Iteration Tasks
-
execution-controller(@RainbowMango , retry on conflict when syncing workload to member clusters #7106) -
binding-controller(@AnupamSingh2004, retry on conflict when syncing work for binding controllers #7121) -
cluster-binding-controller(@AnupamSingh2004, retry on conflict when syncing work for binding controllers #7121)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status