Skip to content

[FEATURE] Add Prometheus Metrics for Workflow and Task Webhook Publishers #1073

Description

@akhilpathivada

Problem

When using the HTTP webhook publishers (workflow_publisher for workflows, task_publisher for tasks), there is no observability into publish success or failure. All outcomes are only logged via SLF4J — no Prometheus/Micrometer counters, timers, or gauges are exposed.

This means operators cannot:

  • Alert on webhook delivery failures
  • Track publish latency or throughput
  • Monitor notification queue depth or saturation

Meanwhile, the archive listeners in the same module already integrate with Monitors:

Archive listener — has metrics:

// ArchivingWorkflowStatusListener.java ✅
Monitors.recordWorkflowArchived(workflow.getWorkflowName(), workflow.getStatus());

Webhook publisher — no metrics:

// StatusChangePublisher.java ❌
LOGGER.debug("Workflow {} publish is successful.", statusChangeNotification.getWorkflowId());
// ... catch block: only LOGGER.error

Proposed Solution

Replicate the existing Monitors static facade pattern (used by archive listeners) into the webhook publishers. This is the same approach already proven in ArchivingWorkflowStatusListener and ArchivingWithTTLWorkflowStatusListener.

New methods in Monitors.java

// Counters
public static void recordWebhookPublishSuccess(String notificationType, String name, String status) { ... }
public static void recordWebhookPublishFailure(String notificationType, String name, String errorType) { ... }
public static void recordWebhookEnqueueFailure(String notificationType, String name) { ... }

// Gauge
public static void recordWebhookQueueDepth(String notificationType, int size) { ... }
Metric Where Trigger
webhook_publish_success StatusChangePublisher.ConsumerThread.run() — after successful publishStatusChangeNotification() HTTP 200/202 from webhook endpoint
webhook_publish_success TaskStatusPublisher.ConsumerThread.run() — after successful publishTaskNotification() HTTP 200/202 from webhook endpoint
webhook_publish_failure Same two locations — in the catch blocks Any exception during publish
webhook_enqueue_failure StatusChangePublisher.enqueueWorkflow() and TaskStatusPublisher.enqueueTask() BlockingQueue.put() failure
webhook_queue_depth StatusChangePublisher and TaskStatusPublisher — after enqueue Queue size change

Example: StatusChangePublisher (before → after)
Before (current — log only):

try {
    workflow = blockingQueue.take();
    statusChangeNotification = new StatusChangeNotification(workflow.toWorkflow());
    publishStatusChangeNotification(statusChangeNotification);
    LOGGER.debug("Workflow {} publish is successful.", statusChangeNotification.getWorkflowId());
} catch (Exception e) {
    LOGGER.error("Error on publishing workflow", e);
}

After (with metrics):

try {
    workflow = blockingQueue.take();
    statusChangeNotification = new StatusChangeNotification(workflow.toWorkflow());
    publishStatusChangeNotification(statusChangeNotification);
    LOGGER.debug("Workflow {} publish is successful.", statusChangeNotification.getWorkflowId());
    Monitors.recordWebhookPublishSuccess("WORKFLOW", workflow.getWorkflowName(), workflow.getStatus().name());
} catch (Exception e) {
    LOGGER.error("Error on publishing workflow", e);
    Monitors.recordWebhookPublishFailure("WORKFLOW", workflow.getWorkflowName(), e.getClass().getSimpleName());
}

Why This Approach

  1. Proven pattern — Archive listeners already use Monitors the same way; this is not a new integration pattern
  2. Zero new dependencies — Both workflow-event-listener and task-status-listener already depend on conductor-core, which transitively provides Monitors and the full Micrometer stack
  3. Consistent — Webhook publishers will use the same observability mechanism as archive listeners

Implementation Impact

Files to modify:

  • core/src/main/java/com/netflix/conductor/metrics/Monitors.java — Add new recordWebhookPublish* static methods
  • workflow-event-listener/.../statuschange/StatusChangePublisher.java — Add Monitors calls in consumer loop success/failure paths and enqueue failure
  • task-status-listener/.../TaskStatusPublisher.java — Add Monitors calls in consumer loop success/failure paths and enqueue failure

No build.gradle changes required.

Backward Compatible

Yes — fully backward compatible:

  1. Additive only — New methods added to Monitors, new call sites added to publishers. No existing method signatures change
  2. No config changes — No new properties or flags required
  3. Lazy metrics — Counters and gauges only materialize in the Prometheus endpoint when first incremented. If webhook listeners are not enabled, these metrics never appear
  4. No API changes — No REST endpoints, payloads, or wire formats change

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions