Skip to content

Improve o11y coverage for reconcilers#1323

Draft
wbrowne wants to merge 2 commits intomainfrom
wb/retry-processor-metrics
Draft

Improve o11y coverage for reconcilers#1323
wbrowne wants to merge 2 commits intomainfrom
wb/retry-processor-metrics

Conversation

@wbrowne
Copy link
Copy Markdown

@wbrowne wbrowne commented Apr 7, 2026

What Changed? Why?

The reconciliation pipeline had two observability gaps:

  • Retry processor itself is a bit of a black box with no metrics. You can't tell how many retries are pending, how long they wait, or what proportion are succeeding vs exhausting their policy.
  • Before retry queuing, errors only become visible at retry_processor_enqueued_total, which misses the first failure entirely if a retry succeeds, and gives no signal when retries are disabled.

This PR adds:

  • reconcilerErrors and watcherErrors count failures at the point they occur before any retry logic so you get an accurate error rate regardless of retry behaviour. Alerting on these catches regressions immediately rather than waiting for retries to exhaust.
  • The retry processor metrics (enqueued_total, executions_total, pending_total, queue_wait_duration_seconds) give you the full retry lifecycle: how much work is queued, how long it's delayed, and whether it's ultimately succeeding, being explicitly requeued, retrying on error, or failing permanently.

operator/retry_processor.go

  • MetricsConfig metrics.Config on RetryProcessorConfig (inherits from informer so all related reconciler metrics share the same namespace)
  • 4 Prometheus metrics: enqueued_total, executions_total (action+result labels), pending_total (GaugeFunc), queue_wait_duration_seconds

operator/informer_controller.go

  • reconcilerErrors *prometheus.CounterVec — incremented on first reconciler error, before retry queuing
  • watcherErrors *prometheus.CounterVec — incremented in each watcher closure (add/update/delete) on error

How was it tested?

Added unit tests

Where did you document your changes?

N/A (AFAIK 😅)

Notes to Reviewers

@wbrowne wbrowne self-assigned this Apr 7, 2026
@wbrowne
Copy link
Copy Markdown
Author

wbrowne commented Apr 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd95db3ee3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@wbrowne wbrowne force-pushed the wb/retry-processor-metrics branch from 211b007 to da384f4 Compare April 8, 2026 19:23
@wbrowne wbrowne moved this from 📬 Triage to 🧑‍💻 In development in Grafana Catalog Team Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🧑‍💻 In development

Development

Successfully merging this pull request may close these issues.

1 participant