Skip to content

Add pluggable business tracing to Conductor core (opt-in, no-op by default) #1113

@anveshreddydonthi

Description

@anveshreddydonthi

Title

Feature: Opt-in OpenTelemetry tracing for workflow start with async correlation

Labels

enhancement, observability


Problem

Workflow execution in Conductor continues long after POST /api/workflow returns (decider queue, sweeper, task scheduling). Operators need a standard way to follow one workflow’s server-side path in a tracing backend—not only HTTP ingress and not a flood of unrelated background traces.

Proposed approach

Use a two-layer model:

  1. Platform (deployer-owned) — OpenTelemetry Java agent for HTTP/DB/Redis export via OTLP. Sampling and collector filters handle volume (e.g. reconciler noise).
  2. Conductor business tracing (this feature) — A small set of named spans for orchestration steps, off by default, behind a pluggable facade.

Async correlation: persist W3C trace context in workflow runtime variables (_trace_context), not in queue payloads. Restore on later decider/system-task work so sweeper cycles stay linked to the originating start.

workflow.start → enqueue decider → decide → enqueue task
        ↓ (_trace_context persisted)
   later: workflow.decide (sweeper) → workflow.status when terminal

Design constraints

  • Default off (conductor.tracing.manual.enabled=false)
  • No mandatory OTel SDK in the server JAR (agent + opentelemetry-api only)
  • v1 scope: start + server orchestration only—not task.poll / task.update / search (avoid prod span storms)

v1 scope

In

  • Business spans: workflow.start, workflow.enqueue_decider, workflow.decide, task.enqueue, system_task.execute
  • Tags: workflow.id, workflow.name, workflow.version, correlation.id; workflow.status on terminal decide
  • Docs and a local/staging validation path (OTLP → collector → Jaeger)

Out

  • Worker poll/update propagation, sub-workflow/event start, full lifecycle tracing, hiding _trace_context from API (later phases)

Acceptance criteria

  • Feature disabled by default; no behavior change when off
  • When enabled with OTel agent: one start request shows correlated spans through decide/enqueue (as applicable)
  • Async decider work correlates via stored context (not orphan roots in Jaeger)
  • Terminal runs visible via workflow.status on decide
  • Documentation for enablement, agent requirements, noise control, and rollback

Follow-up (separate issues)

  • Task update + worker traceparent contract
  • Sub-workflow / event ingress
  • Optional: strip _trace_context from workflow API responses

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions