Skip to content

[Observability] Stop recording optimistic concurrency conflict errors that are retried in metrics #7111

@jabellard

Description

@jabellard

What would you like to be added:

The work_sync_workload_duration_seconds metric should not count Kubernetes conflict errors (HTTP 409) as failures. These are expected retriable errors in distributed systems and should not be recorded with result="error".

Why is this needed:

Problem: Conflict Errors Cause False SLO Violations

In our environment, this metric shows 15% error rate (16.23x burn rate) causing critical alerts, but the system is actually healthy. Nearly all "errors" are Kubernetes conflict errors that the controller successfully retries.

Sample Logs

E0115 02:56:14.736986 Failed to update resource(kind=Deployment, default/nginx-deployment)
in cluster test-cluster-region1, err: Operation cannot be fulfilled on
deployments.apps "nginx-deployment": the object has been modified

E0115 02:56:19.000869 Failed to update resource(kind=Deployment, default/nginx-deployment)
in cluster test-cluster-region1, err: Operation cannot be fulfilled on
deployments.apps "nginx-deployment": the object has been modified

[Pattern repeats - controller retries and eventually succeeds]

These are HTTP 409 Conflict errors from Kubernetes optimistic concurrency control - normal, expected, and automatically retried.

Current Metrics

# Error rate: 15.8% (target is 99% success)
sum(rate(work_sync_workload_duration_seconds_count{result="error"}[5m])) /
sum(rate(work_sync_workload_duration_seconds_count[5m]))

# Burn rate: 16.23x (burning error budget 16x faster than allowed)
slo:current_burn_rate:ratio{sloth_slo="work-sync-workload-availability"}

Nearly all these "errors" are conflict errors that successfully retry.

Why This Matters

Conflict errors are not failures:

  • Temporary and automatically retried (controller succeeds on next attempt)
  • Expected in distributed systems (Kubernetes optimistic concurrency control)
  • Not user-facing (workloads eventually sync successfully)

Impact:

  • False critical alerts (teams paged for "healthy" system)
  • Misleading dashboards (85% availability shown when actual is >99%)
  • Alert fatigue and wasted engineering time

Current: Metric measures retry rate ("Did first attempt succeed?")
Expected: Metric should measure availability ("Did workload eventually sync?")


Proposed Change

File: pkg/controllers/execution/execution_controller.go

Skip recording the metric:

func (c *Controller) syncWork(...) (controllerruntime.Result, error) {
	start := time.Now()
	err := c.syncToClusters(ctx, clusterName, work)

	// Don't count conflict errors  - they are retriable
	...

Impact: Error rate 15% → <1%, Burn rate 16x → <1x

Iteration Tasks

Metadata

Metadata

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions