Skip to content

prometheus metrics for http webhook publishers#1149

Open
adonthi-fws wants to merge 10 commits into
conductor-oss:mainfrom
adonthi-fws:issue-1073
Open

prometheus metrics for http webhook publishers#1149
adonthi-fws wants to merge 10 commits into
conductor-oss:mainfrom
adonthi-fws:issue-1073

Conversation

@adonthi-fws

@adonthi-fws adonthi-fws commented Jun 3, 2026

Copy link
Copy Markdown

Summary

Fixes #1073

Adds Prometheus/Micrometer metrics for HTTP webhook publishers (workflow_publisher and task_publisher), following the same Monitors pattern used by archive listeners (e.g. recordWorkflowArchived).

New metrics:

  • webhook_publish_success — HTTP 200/202 after publish
  • webhook_publish_failure — exception during publish
  • webhook_enqueue_failure — offer() returns false when buffer full
  • webhook_queue_depth — in-memory notification queue size after enqueue

Test plan

  • Enable conductor.workflow-status-listener.type=workflow_publisher and conductor.task-status-listener.type=task_publisher with a reachable webhook URL
  • Run a workflow to completion
  • Confirm metrics at /actuator/prometheus:
    • webhook_publish_success_total
    • webhook_queue_depth
  • (Optional) Point webhook URL at unreachable host and confirm webhook_publish_failure_total

@adonthi-fws

Copy link
Copy Markdown
Author

Hello @v1r3n please review this

@nthmost-orkes nthmost-orkes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission! There are a few things to clean up first...

First, a bit of a consistency improvement to make here:

recordWebhookQueueDepth takes (notificationType, size) but the other three methods all carry a name parameter (workflow/task definition name). That means you can slice success and failure by name but not queue depth. Either add name to recordWebhookQueueDepth to match, or drop it from the others — whichever feels right for how you'd actually query these in Grafana.

@nthmost-orkes nthmost-orkes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: can you double-check that webhook_enqueue_failure will actually fire in practice?

Both publishers use blockingQueue.put(), which blocks when the queue is full rather than throwing — so the only exception it can raise is InterruptedException (thread interrupted). If the queue fills up under load, put() will stall the caller but the failure metric will stay silent.

If the intent is to detect "queue full, item dropped," offer() is the right call — it returns false immediately when the queue is at capacity, giving you a clean place to record the metric and log a warning. Happy to be corrected if there's a different failure mode you had in mind.

@nthmost-orkes nthmost-orkes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more observation, non-blocking but worth doing right: the webhook_queue_depth gauge records a snapshot at enqueue time via AtomicDouble.set(). If the publisher thread drains items between enqueues, the gauge can read much higher than the actual depth at scrape time — you'd see a sawtooth that never fully reflects the live queue state.

A pull-style registration in the constructor gives you a live reading at every Prometheus scrape instead:

Gauge.builder("webhook_queue_depth", blockingQueue, Collection::size)
     .tag("notificationType", NOTIFICATION_TYPE)
     .register(Metrics.globalRegistry);

Then no call needed at enqueue time at all. The existing event_queue_depth gauge in Monitors has the same limitation, so this PR is at least consistent — but since you're adding new infrastructure here it's a good opportunity to set a better pattern.

@adonthi-fws

adonthi-fws commented Jun 15, 2026

Copy link
Copy Markdown
Author

Hi @nthmost-orkes
Thanks for the review!

  1. Queue depth is one shared queue per publisher (TASK vs WORKFLOW), so a per-name depth tag doesn’t apply.
    Kept name on success/failure/enqueue counters for per-workflow/task drill-down. Queue depth is tagged by notificationType only since each publisher has one shared queue; added Javadoc to explain.
  2. enqueue failure — Switched put() → offer() so a full queue records webhook_enqueue_failure and logs a warning instead of blocking silently.
  3. queue depth gauge — Replaced snapshot set() with pull-style Gauge.builder(..., queue, BlockingQueue::size) registered in the constructor.

@nthmost-orkes nthmost-orkes changed the title prometheus metrics for http webhook publisheres prometheus metrics for http webhook publishers Jun 22, 2026
@nthmost-orkes

Copy link
Copy Markdown
Contributor

Hi @nthmost-orkes Thanks for the review!

  1. Queue depth is one shared queue per publisher (TASK vs WORKFLOW), so a per-name depth tag doesn’t apply.
    Kept name on success/failure/enqueue counters for per-workflow/task drill-down. Queue depth is tagged by notificationType only since each publisher has one shared queue; added Javadoc to explain.
  2. enqueue failure — Switched put() → offer() so a full queue records webhook_enqueue_failure and logs a warning instead of blocking silently.
  3. queue depth gauge — Replaced snapshot set() with pull-style Gauge.builder(..., queue, BlockingQueue::size) registered in the constructor.

Thanks for the update -- can you increase the test coverage on this feature? Then we should be good to go.

The three counters (webhook_publish_success / _failure / _enqueue_failure) don't have tests yet — only the gauge does.

The counter ones are cheap to add the same way the gauge test works: add a SimpleMeterRegistry probe, call the method, assert the tagged counter incremented (including the defaultIfBlank → "unknown" fallback). And TaskStatusPublisher has no test file at all today, so all of its new wiring is uncovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add Prometheus Metrics for Workflow and Task Webhook Publishers

2 participants