Skip to content

feat(telemetry): cover qwen serve daemon end-to-end with OpenTelemetry #4554

@doudouOUC

Description

@doudouOUC

Background

Qwen Code's OpenTelemetry implementation is increasingly complete for the interactive/runtime path, but qwen serve still has a daemon-specific observability gap.

Today the serve daemon process handles HTTP routing, session lifecycle APIs, bridge queueing, ACP child process management, prompt dispatch, cancellation, SSE/EventBus fan-out, and bridge error translation. Most of those daemon-layer operations are not represented in OpenTelemetry. The ACP child can initialize telemetry after loadCliConfig(...) and may emit agent-internal model/tool logs or spans, but that does not cover the full daemon path from HTTP request to bridge to child to response/events.

Current findings:

  • qwen serve starts from packages/cli/src/commands/serve.ts and packages/cli/src/serve/runQwenServe.ts; it calls the serve runner directly and does not construct a Config for the daemon process, so initializeTelemetry(...) is not run in the daemon itself.
  • Config initializes telemetry from packages/core/src/config/config.ts, so telemetry exists mainly in paths that build a normal runtime config.
  • ACP sessions call loadCliConfig(...) in packages/cli/src/acp-integration/acpAgent.ts, so child processes can have telemetry if settings enable it.
  • The ACP session path logs user prompts/tool calls, but it does not currently provide the same top-level interaction span coverage as the interactive client.ts path.
  • sendBridgeError(...) and the bridge lifecycle are primarily observable through daemon stderr today, not OTel traces/logs.

Related but distinct work:

This issue is narrower than #3731 and #4548: make daemon-mode execution reconstructable as a coherent OpenTelemetry trace/log/metric story.

Problem

When a daemon client sees an error such as POST /session/:id/prompt returning HTTP 500, operators cannot reconstruct the complete path from telemetry alone:

  1. inbound HTTP request to the daemon
  2. route validation and client/session lookup
  3. bridge channel selection or child spawn/reuse
  4. prompt queue wait and dispatch
  5. ACP child prompt handling
  6. model request and tool execution
  7. SSE/EventBus output fan-out
  8. cancellation, close, child exit, and error translation

Some lower-level model/tool telemetry may exist in the child, but the parent daemon span, bridge span, queue timing, lifecycle events, and error mapping are missing. This leaves gaps between client-visible HTTP failures and agent-internal telemetry.

There is also a multi-session concern: the current telemetry SDK is process-level, while daemon mode may serve multiple sessions over time. Any daemon/ACP telemetry work must avoid stale session root context and must attribute spans/logs to the correct workspace, session, prompt, and client.

Proposal

Add OpenTelemetry coverage for the qwen serve daemon path.

Suggested scope:

1. Initialize telemetry in the daemon process

  • Initialize OTel before the HTTP server starts when telemetry is enabled for the daemon workspace/config.
  • Reuse existing exporter, shutdown, diagnostic suppression, resource-attribute, and bounded flush semantics from the core telemetry SDK.
  • Ensure the daemon process does not emit exporter diagnostics to stdout/stderr in structured/non-interactive contexts.
  • Flush/shutdown telemetry during serve shutdown/drain.

2. Add daemon HTTP/request spans

Create a span per relevant daemon request, using route templates rather than raw URLs. At minimum cover:

  • POST /session
  • POST /session/:id/load
  • POST /session/:id/prompt
  • POST /session/:id/cancel
  • DELETE /session/:id
  • GET /workspace/:id/sessions
  • SSE/EventBus subscription routes if applicable

Recommended attributes:

  • HTTP method, route template, status code
  • workspace id/path hash where safe
  • session id when known
  • prompt id when known
  • client id when known
  • request duration
  • error code/type and sanitized error message for failures

3. Add bridge and child-process spans/events

Instrument the daemon bridge around operations that are invisible from ACP child telemetry:

  • session create/load/close/cancel
  • child process spawn/reuse/exit
  • bridge channel lookup
  • prompt queue wait time
  • prompt dispatch duration
  • cancel propagation to ACP child
  • pending permission cancellation
  • EventBus/SSE publish/fan-out failures
  • bridge transport close/errors

This should make a prompt trace show where time was spent before the ACP child began model/tool work.

4. Propagate trace context across daemon and ACP child

Define a W3C trace context boundary between daemon request spans and ACP child work.

Possible approaches:

  • pass traceparent/tracestate through an ACP request metadata field if the protocol allows it;
  • pass a daemon-generated trace context in an internal envelope field that is not exposed as user prompt content;
  • fall back to OTel links if strict parent-child context is unsafe for queued or long-lived work.

The child-side prompt/interaction span should be parented to, or linked from, the daemon prompt/bridge span so the trace is navigable end to end.

5. Align ACP session tracing with interactive tracing

Bring ACP prompt handling closer to the interactive client.ts trace tree:

  • create a top-level interaction/prompt span for each ACP prompt;
  • ensure child LLM spans and tool spans attach under the correct prompt span;
  • preserve existing prompt/tool log events;
  • avoid global session-root leakage across multiple sessions in one long-lived process.

6. Add daemon metrics/log records where useful

Metrics/logs should complement traces without creating high-cardinality explosions.

Useful low-cardinality metrics may include:

  • request count/latency by route and status class
  • active sessions by workspace
  • prompt queue wait duration
  • child process spawn/restart count
  • bridge error count by error code/type
  • cancellation/close count

Log records should include trace/span ids where possible, especially for bridge errors and child stderr correlation.

Acceptance criteria

  • With telemetry enabled, POST /session/:id/prompt produces a trace that starts at the daemon HTTP route and continues through bridge dispatch into ACP child prompt handling, LLM requests, and tool execution where applicable.
  • A generic daemon 500 is marked on the relevant span and emits a correlated log record with route, session id, prompt id if known, and sanitized error details.
  • Session lifecycle APIs (create, load, cancel, close, list) emit useful spans/events or metrics.
  • Child process spawn/reuse/exit and bridge transport failures are observable.
  • Trace context is propagated or linked across the daemon-to-ACP-child boundary.
  • Multiple sessions handled by one daemon do not share stale session root context; spans/logs are attributed to the correct session/workspace/prompt.
  • Existing interactive/TUI telemetry behavior remains unchanged.
  • OTel shutdown is bounded and runs during daemon shutdown/drain.
  • Tests cover daemon telemetry initialization, route span attributes, bridge error span status/logging, context propagation/linking, and multi-session attribution.

Non-goals

Open questions

  • What config should control daemon telemetry initialization when qwen serve has not created a normal session Config yet?
  • Should daemon HTTP route spans be implemented manually, through HTTP/Express instrumentation, or both?
  • Should daemon-to-child context use parent/child propagation or OTel links for queued prompt work?
  • Should ACP child telemetry be one process per session, one process per workspace, or explicitly multi-session-safe with refreshed session context?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions