Title
Feature: Opt-in OpenTelemetry tracing for workflow start with async correlation
Labels
enhancement, observability
Problem
Workflow execution in Conductor continues long after POST /api/workflow returns (decider queue, sweeper, task scheduling). Operators need a standard way to follow one workflow’s server-side path in a tracing backend—not only HTTP ingress and not a flood of unrelated background traces.
Proposed approach
Use a two-layer model:
- Platform (deployer-owned) — OpenTelemetry Java agent for HTTP/DB/Redis export via OTLP. Sampling and collector filters handle volume (e.g. reconciler noise).
- Conductor business tracing (this feature) — A small set of named spans for orchestration steps, off by default, behind a pluggable facade.
Async correlation: persist W3C trace context in workflow runtime variables (_trace_context), not in queue payloads. Restore on later decider/system-task work so sweeper cycles stay linked to the originating start.
workflow.start → enqueue decider → decide → enqueue task
↓ (_trace_context persisted)
later: workflow.decide (sweeper) → workflow.status when terminal
Design constraints
- Default off (
conductor.tracing.manual.enabled=false)
- No mandatory OTel SDK in the server JAR (agent +
opentelemetry-api only)
- v1 scope: start + server orchestration only—not
task.poll / task.update / search (avoid prod span storms)
v1 scope
In
- Business spans:
workflow.start, workflow.enqueue_decider, workflow.decide, task.enqueue, system_task.execute
- Tags:
workflow.id, workflow.name, workflow.version, correlation.id; workflow.status on terminal decide
- Docs and a local/staging validation path (OTLP → collector → Jaeger)
Out
- Worker poll/update propagation, sub-workflow/event start, full lifecycle tracing, hiding
_trace_context from API (later phases)
Acceptance criteria
Follow-up (separate issues)
- Task update + worker
traceparent contract
- Sub-workflow / event ingress
- Optional: strip
_trace_context from workflow API responses
Title
Feature: Opt-in OpenTelemetry tracing for workflow start with async correlation
Labels
enhancement,observabilityProblem
Workflow execution in Conductor continues long after
POST /api/workflowreturns (decider queue, sweeper, task scheduling). Operators need a standard way to follow one workflow’s server-side path in a tracing backend—not only HTTP ingress and not a flood of unrelated background traces.Proposed approach
Use a two-layer model:
Async correlation: persist W3C trace context in workflow runtime variables (
_trace_context), not in queue payloads. Restore on later decider/system-task work so sweeper cycles stay linked to the originating start.Design constraints
conductor.tracing.manual.enabled=false)opentelemetry-apionly)task.poll/task.update/ search (avoid prod span storms)v1 scope
In
workflow.start,workflow.enqueue_decider,workflow.decide,task.enqueue,system_task.executeworkflow.id,workflow.name,workflow.version,correlation.id;workflow.statuson terminal decideOut
_trace_contextfrom API (later phases)Acceptance criteria
workflow.statuson decideFollow-up (separate issues)
traceparentcontract_trace_contextfrom workflow API responses