Skip to content

feat(observability): Structured Observability Enhancement — Rich Events, OTel Trace Correlation, and Bridge Refactoring#7233

Closed
FTDGRT wants to merge 3 commits into
zeroclaw-labs:masterfrom
FTDGRT:feat/obs-enhancement
Closed

feat(observability): Structured Observability Enhancement — Rich Events, OTel Trace Correlation, and Bridge Refactoring#7233
FTDGRT wants to merge 3 commits into
zeroclaw-labs:masterfrom
FTDGRT:feat/obs-enhancement

Conversation

@FTDGRT

@FTDGRT FTDGRT commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

RFC: #7232

Summary

  • What changed and why: ZeroClaw's observability surface had three gaps — sparse event context (no channel/agent attribution, no LLM I/O, no structured token breakdown), flat uncorrelated OTel spans (one independent span per event, no parent-child linking), and inconsistent emission architecture (two divergent paths with partial bridge coverage). This PR addresses all three in a coordinated change.

  • Scope boundary: This PR does not migrate loop_.rs and tool_execution.rs to the record! emission path — they continue to emit ObserverEvent directly with turn_id: None. That migration is planned for a follow-up PR.

  • Blast radius: Enriched event context gives every downstream observer channel attribution, agent identity, and turn correlation — making dashboards, alerts, and debugging significantly more actionable. OTel backends now receive correlated traces per turn instead of orphaned spans, enabling proper latency analysis and span navigation. The agent runtime's direct observer.record_event() calls in agent.rs are migrated to the record! macro path, making observer_bridge the single authoritative projection from log events to typed observer events. The ObserverEvent schema expands with Option<_> fields, preserving backward compatibility — downstream observers using .. in match arms require no changes.

Changes

Part 1: Enriched ObserverEvent Schema

Six core event variants gain three common attribution fields plus variant-specific content fields:

  • channel: Option<String>, agent_alias: Option<String>, turn_id: Option<String> added to AgentStart, LlmRequest, LlmResponse, AgentEnd, ToolCallStart, ToolCall
  • TurnTokenUsage { input_tokens, output_tokens } replaces the flat tokens_used: Option<u64> in AgentEnd, plus cost_usd: Option<f64>
  • LlmRequest gains user_message: Option<String>
  • LlmResponse gains response_content: Option<String>
  • ToolCallStart gains tool_call_id, arguments: Option<String>
  • ToolCall gains tool_call_id, arguments, result: Option<String>

Part 2: Content Policy Enforcement

A new ContentProcessor module governs what content enters observability events:

  • Two independent policies: LlmIoPolicy (Off / Redacted / Full) and ToolIoPolicy (Off / Redacted / Full)
  • LeakDetector::scan() runs on all captured content before emission
  • Redacted mode: leak-scan + truncate to configurable char limit
  • Full mode: leak-scan but no truncation
  • Default is Off — no LLM I/O captured unless operator explicitly opts in

New config keys:

[observability]
log_llm_io = "off"             # "off" | "redacted" | "full"
log_llm_io_max_chars = 200     # truncation limit for redacted mode

Part 3: OTel Trace Correlation via turn_id

The OtelObserver is rewritten to produce a single trace per agent turn:

  • AgentStart opens a parent span (gen_ai.agent.invoke), stored in active_agent_spans keyed by turn_id
  • LlmRequest, LlmResponse, ToolCallStart, ToolCall create child spans under the parent via parent_cx_for(turn_id)
  • AgentEnd closes the parent span with cached I/O, token usage, cost, and duration
  • All span attributes follow OpenTelemetry Gen-AI semantic conventions (gen_ai.provider.name, gen_ai.usage.input_tokens, etc.)
  • When turn_id is absent, spans degrade gracefully to orphans (backward compatible)
  • flush() cleans up orphaned spans from unclosed turns

Part 4: Observer Bridge Refactoring

observer_bridge::project() expanded to a complete projection layer:

  • Now handles all six agent event types (previously a subset)
  • Extracts channel, agent_alias, turn_id from LogEvent span attribution and attributes
  • Projects ToolCallStart and LlmRequest variants from log events to typed ObserverEvent
  • All existing variants now forward attribution fields

Part 5: Agent Runtime Overhaul

The largest change in this PR — agent.rs is refactored to integrate the new observability surface:

  • Added agent_alias field to Agent struct and AgentBuilder; builder method agent_alias() sets it, from_config_* reads it from config
  • New turn_id() helper generates a UUID per turn, serving as the OTel correlation key
  • Observer is now created via observability::create_observer() and installed into the observer bridge via zeroclaw_log::set_observer_bridge()
  • AgentStart and AgentEnd are now recorded on all exit paths in turn methods, wrapped by attribution_span! to ensure agent alias propagates through the span tree

Part 6: Gateway Channel Attribution

  • WebSocket sessions inject a tracing::info_span! with agent_alias, model_provider, model, channel = "wss" and wrap the turn in .instrument(span)
  • agent_alias is now a required query parameter on the WS endpoint
  • SSE broadcast observer handles ToolCallStart variant and expanded AgentEnd (with TurnTokenUsage + cost_usd)

Observer Implementations Updated

  • PrometheusObserver: token gauge computed from TurnTokenUsage { input_tokens + output_tokens }
  • LogObserver: all match arms updated with new attribution fields (channel, agent_alias, turn_id)
  • VerboseObserver: match arms updated to accept new fields with ..
  • NoopObserver: match arms updated to accept expanded variants

Validation Evidence

$ cargo fmt --all -- --check
# (no output — clean)

$ cargo clippy --locked --all-targets -- -D clippy::correctness
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 4m 35s
# (no errors)

$ cargo test --locked --verbose
test result: ok. 159 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
test result: ok. 9 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
test result: ok. 0 passed; 0 failed; 7 ignored; 0 measured; 0 filtered out
  • Beyond CI: Verified end-to-end by starting zeroclaw daemon, chatting via http://127.0.0.1:42617/agents, and confirming enriched trace data (turn correlation, token breakdown, channel attribution) appears on the configured Langfuse backend. The trace structure and llm.response example shown in Langfuse are as follows:
image image

Security & Privacy Impact

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No — OTel endpoint is existing config; no new outbound calls
  • Secrets / tokens / credentials handling changed? Nootel_headers marked #[secret] for encryption at rest; ContentProcessor with LeakDetector::scan() redacts detected secrets before capture
  • PII or personal data in diff, tests, fixtures, or docs? No

Compatibility

  • Backward compatible? Yes — all new ObserverEvent fields are Option<_> with None defaults; new config keys have safe defaults (log_llm_io = "off")
  • Config / env / CLI surface changed? Yes — new keys in [observability] section (log_llm_io, log_llm_io_max_chars). All optional with safe defaults. No upgrade steps required — existing configs work unchanged.

Rollback

risk: high

git revert 3dc516f6edec5e2ec6ccd759c9d801511d1518a3 c677d366d408d0f934c2a778ab903245fced6364

@github-actions github-actions Bot added docs Auto scope: docs/markdown/template files changed. agent Auto scope: src/agent/** changed. channel Auto scope: src/channels/** changed. config Auto scope: src/config/** changed. gateway Auto scope: src/gateway/** changed. observability Auto scope: src/observability/** changed. runtime Auto scope: src/runtime/** changed. labels Jun 5, 2026
@FTDGRT FTDGRT changed the title feat(observability):Structured Observability Enhancement — Rich Events, OTel Trace Correlation, and Bridge Refactoring feat(observability): Structured Observability Enhancement — Rich Events, OTel Trace Correlation, and Bridge Refactoring Jun 5, 2026
@FTDGRT FTDGRT force-pushed the feat/obs-enhancement branch from 69cad80 to 3dc516f Compare June 5, 2026 03:13

@singlerider singlerider left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC #7232 is solid and most of this is well-built — the ContentProcessor policy/leak-scan ordering is careful, the OTel turn-correlation design is right, and the gateway WS attribution span is correctly implemented. But the attribution story is incomplete on the most common local path, the PR introduces duplicate AgentStart emissions, and there's unrelated scope creep. These need to be resolved before it lands.

🔴 New AgentStart/AgentEnd records are un-attributed on the CLI path

agent.rs contains no attribution_span! / .instrument(...) anywhere, yet the body claims AgentStart/AgentEnd are "wrapped by attribution_span! to ensure agent alias propagates." That holds only for the gateway WS path, which correctly opens the span (ws.rs:824, carrying agent_alias/model_provider/model/channel = "wss" via attribution_fields() and .instrument(span) — nicely done). But turn() itself opens no span, and its two real callers don't either:

  • run_single() (agent.rs:3310) → self.turn(...) with no surrounding span.
  • run_interactive() (agent.rs:3329) → self.turn(...) in the REPL loop, no span.
  • The CLI dispatch fn that calls them (around agent.rs:3401) emits its own AgentStart and also opens no attribution_span!.

So on zeroclaw agent -a <alias> -m ... (and interactive), the new AgentStart/AgentEnd record! calls fire with model/model_provider/turn_id in attrs but no agent_alias in the attribution span — exactly the "NEVER log un-attributed" case AGENTS.md calls out, and it means the feature's headline (per-event agent identity) silently doesn't hold for the most common local invocation. The fix is to open a attribution_span!/scope-bearing turn span inside turn() (or at the run_single/run_interactive/dispatch entry) using attribution_fields(), mirroring what the WS path already does, so all entry points carry attribution rather than just the gateway.

🔴 Duplicate AgentStart per turn

master's agent.rs emits zero AgentStart records; this PR adds three. On the CLI path both the dispatch fn (≈3401) and turn() (2040) now emit AgentStart for a single user turn, so downstream observers (SSE/dashboard/OTel) will see two starts per turn. Pick one authoritative emission site — most likely turn(), with the dispatch-level one removed — so the event stream stays one-start-per-turn. This pairs with the attribution fix above: whichever site survives is the one that needs the span.

🔴 Unrelated Ollama change — drop it from this PR

The diff touches crates/zeroclaw-providers/src/ollama.rs with two changes that have nothing to do with observability: removing let temperature = temperature.unwrap_or(self.default_temperature()); from chat(), and swapping a test's tokio::spawnzeroclaw_spawn::spawn!. This overlaps the contested Option<f64> temperature area (#7095/#7231/#7213) and doesn't belong in an observability feature. Please pull it out so this PR's blast radius matches its title; if the Ollama change is wanted, it's its own focused PR.

🟢 ContentProcessor security ordering is correct

content_processing.rs does this right: off-policy short-circuits to None (no capture), content is pre-capped to PRE_SANITIZATION_CAP_CHARS before scanning, LeakDetector::scan runs on the capped content, and the display-truncation for Redacted happens on the already-redacted string — so truncation can't bisect a secret and leak a fragment, and Full is still leak-scanned (never raw). LlmIoPolicy::from_raw / ToolIoPolicy::from_raw default unknown input to the safe option (Off / Redacted). Good secure-by-default posture.

🟢 observer_bridge projection is robust

project() extracts channel (with channel_type fallback), agent_alias, and turn_id from log attributes via unwrap_or_default() and converts empties to None — no unwraps that can panic, graceful degradation when attribution is absent. It forwards the three attribution fields across all the expanded variants. This is the right shape for the single-projection-layer goal.

Two smaller notes

  • log_llm_io / log_tool_io are String config fields parsed into the LlmIoPolicy/ToolIoPolicy enums at the boundary. The match logic uses the enums (good), and it follows the existing tool_io precedent, so it's consistent — but a typo'd value ("redactd") silently resolves to the safe default with no warning. A one-line "unknown log_llm_io value, defaulting to off" warning at config resolution would save an operator a confusing "why is nothing captured" debugging session. Non-blocking.
  • The two commit messages are headline-only (no body) for a ~1100-line net feature; a short body per commit would help the eventual squash. The first headline is also missing a space after the colon (feat(observability):Structured). Non-blocking.

The architecture and the security-sensitive pieces are good — the blockers are attribution completeness on the CLI path, the duplicate AgentStart, and the unrelated Ollama edit. Resolve those (and ideally the warning + commit-body nits) and I'll take another pass.

let user_msg_content = self.content_processor.process_user_message(user_message);
::zeroclaw_log::record!(
INFO,
::zeroclaw_log::Event::new(module_path!(), ::zeroclaw_log::Action::AgentStart)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This new AgentStart (and the paired AgentEnd on the exit paths of turn()) is un-attributed on the CLI path. agent.rs opens no attribution_span! anywhere, and turn()'s two real callers — run_single (≈3310) and run_interactive (≈3329) — don't either, so on zeroclaw agent -a <alias> -m ... these records fire with model/model_provider/turn_id in attrs but no agent_alias on the span. That's the AGENTS.md "never log un-attributed" case, and it means the feature's per-event agent identity silently doesn't hold for the most common local path.

The gateway WS path already does this correctly at ws.rs:824 — mirror it on the CLI entry points using the existing attribution_fields() helper. In run_single:

pub async fn run_single(&mut self, message: &str) -> Result<String> {
    use ::zeroclaw_log::Instrument as _;
    let (alias, provider, model) = self.attribution_fields();
    let span = ::zeroclaw_log::info_span!(
        target: "zeroclaw_log_internal_scope",
        "zeroclaw_scope",
        agent_alias = %alias,
        model_provider = %provider,
        model = %model,
        channel = "cli",
    );
    self.turn(message).instrument(span).await
}

Do the same per-turn inside run_interactive's loop (build the span from self.attribution_fields() each iteration, then self.turn(&msg.content).instrument(span).await). That carries agent_alias + channel = "cli" into every record this turn() emits, matching what the WS span already provides.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. run_single and run_interactive now both open an info_span! built from self.attribution_fields() and wrap the turn() call with .instrument(span), exactly mirroring the WS path. channel = "cli" is set on both. As noted above, the actual zeroclaw agent -a <alias> -m ... CLI path goes through loop_.rs::run rather than these methods, so fuller CLI observability will come in the follow-up PR — but every caller that does reach agent.rs::turn is now attributed.

});
::zeroclaw_log::record!(
INFO,
::zeroclaw_log::Event::new(module_path!(), ::zeroclaw_log::Action::AgentStart)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Duplicate AgentStart. master's agent.rs emits zero AgentStart records; this PR adds three, and on the CLI path both this dispatch-level emission and the new one inside turn() (line 2040) fire for a single user turn — so SSE/dashboard/OTel observers see two starts per turn.

Pick one authoritative site. turn() is the better home (it owns the turn_id correlation key and runs on every entry point), so I'd drop this dispatch-level AgentStart and its paired AgentEnd below. Once the run_single/run_interactive spans from the other comment are in place, turn()'s records carry full attribution and this one is redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. The dispatch-level AgentStart and AgentEnd records (and the start = Instant::now() timer and provider/model resolution that existed only to feed them) are gone. turn() is the single emission site.

@FTDGRT FTDGRT force-pushed the feat/obs-enhancement branch from 3dc516f to 2719490 Compare June 5, 2026 08:11
@FTDGRT

FTDGRT commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

RFC #7232 is solid and most of this is well-built — the ContentProcessor policy/leak-scan ordering is careful, the OTel turn-correlation design is right, and the gateway WS attribution span is correctly implemented. But the attribution story is incomplete on the most common local path, the PR introduces duplicate AgentStart emissions, and there's unrelated scope creep. These need to be resolved before it lands.

🔴 New AgentStart/AgentEnd records are un-attributed on the CLI path

agent.rs contains no attribution_span! / .instrument(...) anywhere, yet the body claims AgentStart/AgentEnd are "wrapped by attribution_span! to ensure agent alias propagates." That holds only for the gateway WS path, which correctly opens the span (ws.rs:824, carrying agent_alias/model_provider/model/channel = "wss" via attribution_fields() and .instrument(span) — nicely done). But turn() itself opens no span, and its two real callers don't either:

  • run_single() (agent.rs:3310) → self.turn(...) with no surrounding span.
  • run_interactive() (agent.rs:3329) → self.turn(...) in the REPL loop, no span.
  • The CLI dispatch fn that calls them (around agent.rs:3401) emits its own AgentStart and also opens no attribution_span!.

So on zeroclaw agent -a <alias> -m ... (and interactive), the new AgentStart/AgentEnd record! calls fire with model/model_provider/turn_id in attrs but no agent_alias in the attribution span — exactly the "NEVER log un-attributed" case AGENTS.md calls out, and it means the feature's headline (per-event agent identity) silently doesn't hold for the most common local invocation. The fix is to open a attribution_span!/scope-bearing turn span inside turn() (or at the run_single/run_interactive/dispatch entry) using attribution_fields(), mirroring what the WS path already does, so all entry points carry attribution rather than just the gateway.

🔴 Duplicate AgentStart per turn

master's agent.rs emits zero AgentStart records; this PR adds three. On the CLI path both the dispatch fn (≈3401) and turn() (2040) now emit AgentStart for a single user turn, so downstream observers (SSE/dashboard/OTel) will see two starts per turn. Pick one authoritative emission site — most likely turn(), with the dispatch-level one removed — so the event stream stays one-start-per-turn. This pairs with the attribution fix above: whichever site survives is the one that needs the span.

🔴 Unrelated Ollama change — drop it from this PR

The diff touches crates/zeroclaw-providers/src/ollama.rs with two changes that have nothing to do with observability: removing let temperature = temperature.unwrap_or(self.default_temperature()); from chat(), and swapping a test's tokio::spawnzeroclaw_spawn::spawn!. This overlaps the contested Option<f64> temperature area (#7095/#7231/#7213) and doesn't belong in an observability feature. Please pull it out so this PR's blast radius matches its title; if the Ollama change is wanted, it's its own focused PR.

🟢 ContentProcessor security ordering is correct

content_processing.rs does this right: off-policy short-circuits to None (no capture), content is pre-capped to PRE_SANITIZATION_CAP_CHARS before scanning, LeakDetector::scan runs on the capped content, and the display-truncation for Redacted happens on the already-redacted string — so truncation can't bisect a secret and leak a fragment, and Full is still leak-scanned (never raw). LlmIoPolicy::from_raw / ToolIoPolicy::from_raw default unknown input to the safe option (Off / Redacted). Good secure-by-default posture.

🟢 observer_bridge projection is robust

project() extracts channel (with channel_type fallback), agent_alias, and turn_id from log attributes via unwrap_or_default() and converts empties to None — no unwraps that can panic, graceful degradation when attribution is absent. It forwards the three attribution fields across all the expanded variants. This is the right shape for the single-projection-layer goal.

Two smaller notes

  • log_llm_io / log_tool_io are String config fields parsed into the LlmIoPolicy/ToolIoPolicy enums at the boundary. The match logic uses the enums (good), and it follows the existing tool_io precedent, so it's consistent — but a typo'd value ("redactd") silently resolves to the safe default with no warning. A one-line "unknown log_llm_io value, defaulting to off" warning at config resolution would save an operator a confusing "why is nothing captured" debugging session. Non-blocking.
  • The two commit messages are headline-only (no body) for a ~1100-line net feature; a short body per commit would help the eventual squash. The first headline is also missing a space after the colon (feat(observability):Structured). Non-blocking.

The architecture and the security-sensitive pieces are good — the blockers are attribution completeness on the CLI path, the duplicate AgentStart, and the unrelated Ollama edit. Resolve those (and ideally the warning + commit-body nits) and I'll take another pass.

Hi,@singlerider Thanks for the detailed review and the clear blocker descriptions — they made the fixes straightforward.

🔴 Blocker 1 — CLI path attribution

Addressed in the latest commit (fix(observability): carry CLI attribution through turn events).

run_single() now opens an info_span! carrying agent_alias, model_provider, model, and channel = "cli" before calling self.turn(message).instrument(span). run_interactive() does the same per-turn inside the REPL loop. The duplicate AgentStart/AgentEnd at the dispatch level in agent.rs::run has been removed — turn() is now the single authoritative emission site for those events, and both entry points carry the attribution span.

One thing worth noting: run_single and run_interactive are called from agent.rs::run, which is a standalone function not currently reachable from the CLI. The actual zeroclaw agent -a <alias> -m ... path goes through loop_.rs::run (re-exported as zeroclaw_runtime::agent::run via mod.rs), which calls run_tool_call_loop directly without going through agent.rs::turn. So the CLI path still doesn't carry this attribution today — fuller observability coverage for that path is planned for a follow-up PR. The fixes in this commit ensure that any code path that does go through agent.rs::turn carries proper attribution.

🔴 Blocker 2 — Duplicate AgentStart per turn

Also addressed in the same commit. The AgentStart and AgentEnd records at the dispatch level in agent.rs::run (along with the start = Instant::now() timer and the provider/model resolution block that existed only to feed them) have been removed entirely. turn() remains the single emission site.

🔴 Blocker 3 — Unrelated Ollama change

This is a misunderstanding — crates/zeroclaw-providers/src/ollama.rs is not in this PR's diff. The Ollama changes described in the review are not part of this PR.

Non-blocking notes

The log_llm_io / log_tool_io unknown-value warning is also in the latest commit — warn_unknown_llm_policy and warn_unknown_tool_policy are now called at ContentProcessor::from_observability, so a misconfigured value logs a warning and falls back to the safe default rather than silently doing nothing.

@FTDGRT FTDGRT requested a review from singlerider June 5, 2026 08:31
@Audacity88 Audacity88 added enhancement New feature or request size: XL Auto size: >1000 non-doc changed lines. risk: high Auto risk: security/runtime/gateway/tools/workflows. labels Jun 5, 2026

@Audacity88 Audacity88 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed current head 2719490 against the PR body, RFC #7232, the prior singlerider review, the latest author replies, and the public diff, checks, labels, and review metadata. I agree that the duplicate AgentStart issue is fixed, the agent.rs run_single / run_interactive wrappers now carry a CLI attribution span, and the Ollama file is no longer in this diff. I am still requesting changes because the PR still has an unsafe direct tool-I/O path, and because the merge plan needs a clearer boundary between the trace-correlation slice, the content-capture policy, and the overlapping open observability work.

✅ Resolved — duplicate AgentStart and the agent.rs wrapper attribution

The dispatch-level AgentStart / AgentEnd records are gone, so turn() is now the single emission site on that agent.rs path. run_single() and run_interactive() also now wrap calls to turn() in a span carrying agent_alias, provider, model, and channel = "cli". That resolves the duplicate-start finding and fixes attribution for callers that actually reach agent.rs::turn.

🔴 Blocking — direct tool execution still bypasses the content policy

tool_execution.rs now serializes full tool arguments once and sends them directly through ObserverEvent::ToolCallStart / ObserverEvent::ToolCall with arguments: Some(full_args.clone()). On success and failure it also sends the tool result/error text through ObserverEvent::ToolCall. Those values do not pass through ContentProcessor, do not honor log_tool_io = "off", and do not apply the denylist. The result text gets the older scrub_credentials() treatment before the typed observer event, but arguments are not leak-scanned at all.

That matters because otel.rs copies those same fields into trace attributes such as gen_ai.tool.arguments, input.value, gen_ai.tool.result, and output.value. So enabling OTel can still expose full tool input/output from this path even when the PR body says captured content is governed by ContentProcessor and LeakDetector::scan().

Please either keep the direct ObserverEvent tool paths metadata-only until the follow-up migration, or route both tool arguments and tool results through the same log_tool_io policy, leak detector, truncation, and denylist before they are emitted. This path also still emits channel: None, agent_alias: None, and turn_id: None, so it does not participate in the new correlation story.

There is a smaller version of the same issue in agent.rs: execute_tool_call() processes the tool result with process_tool_result(), but ToolCallStart still records args_json directly. Tool arguments need the same policy gate as results.

🔴 Blocking — the PR still needs a staged routing decision for #7232

In #7232, the safest routing was to keep the first slice focused on metadata attribution, turn_id, bridge projection, and OTel correlation, with content capture policy and broader runtime migration handled separately unless this PR explicitly supersedes or sequences the related work. That split is not just process overhead: the content-capture policy has a different privacy review surface than trace correlation, and the open observability PRs/issues already cover parts of this same area. This PR still combines content capture policy, expanded ObserverEvent schema, OTel trace rewrite, bridge projection, runtime changes, gateway changes, and docs in one high-risk XL diff.

The body also now contains an important caveat: loop_.rs and tool_execution.rs intentionally keep direct ObserverEvent emission with turn_id: None, and the author notes that the actual zeroclaw agent -a <alias> -m ... path still goes through loop_.rs. That is a valid follow-up boundary only if the PR narrows its headline claims and routing. As written, it still says every downstream observer gets channel attribution, agent identity, and turn correlation.

Please either narrow this PR to the first safe slice, or update the body to make the dependency / supersede / follow-up plan explicit against the overlapping work: #6641, #6642, #6966, #6190, #7151, and #7221. The merge target should be clear enough that reviewers can tell what lands now, what remains intentionally out of scope, and which open PRs/issues this PR replaces or depends on.

🟡 Warning — docs default does not match the schema

The schema and PR body set log_llm_io_max_chars to the safe default of 200, but docs/book/src/ops/observability.md shows log_llm_io_max_chars = 10000, and the defaults paragraph below it omits the new LLM defaults. Please align the docs with the shipped defaults so operators do not copy a much larger capture limit by accident.

🟢 What looks good — the policy shape is solid where it is actually used

The ContentProcessor design itself is good: LLM capture defaults off, redacted mode scans before truncating, full mode still leak-scans, and unknown policy values now warn before falling back to safe defaults. The bridge and OTel parent/child shape are also pointed in the right direction. The remaining blockers are about making the actual emission paths match those guarantees and staging the PR so the high-risk observability surface can be reviewed safely.

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed head commit 2719490 against the PR body, RFC #7232, both prior reviews from @singlerider and @Audacity88, author responses in inline threads, the diff, CI status, and FND-006 §4.6 (Observability as Debuggability). This PR makes meaningful progress on turn correlation and OTel trace hierarchy, but two blocking issues remain: the tool execution path bypasses the content policy entirely, and the PR's scope still mixes multiple high-risk observability concerns without a clear staging plan.

🔴 Blocking — tool I/O bypasses ContentProcessor and policy enforcement

tool_execution.rs lines 57–61 and 173–232 emit ObserverEvent::ToolCallStart and ObserverEvent::ToolCall with full arguments: Some(full_args.clone()) and result: Some(...) fields. These go directly to the observer with zero policy gating — no ContentProcessor, no log_tool_io check, no denylist filter, no LeakDetector::scan(). The only sanitization is the legacy scrub_credentials() on results, which predates the new leak detector.

That matters because otel.rs (lines 200–350 in this PR's version) copies those same arguments and result fields into OTel span attributes like gen_ai.tool.arguments, input.value, gen_ai.tool.result, and output.value. So when an operator enables OTel but leaves log_tool_io = "off" (the safe default), tool I/O still flows to the OTel backend unredacted. The PR body claims "captured content is governed by ContentProcessor and LeakDetector::scan()" — that's true for LLM I/O routed through agent.rs, but false for this tool path.

The fix: either keep tool_execution.rs metadata-only (arguments: None, result: None) until the follow-up migration the PR body mentions, or route both tool arguments and results through the same log_tool_io policy, leak detector, truncation, and denylist before emitting ObserverEvent. The agent.rs path at line ~1850 also has a smaller version of this: execute_tool_call() processes the result via process_tool_result(), but ToolCallStart still records args_json directly without policy gating.

This isn't theoretical — per FND-006 §4.6, observability infrastructure only translates into diagnosable systems when contributors use it with the discipline to protect what goes out. The content policy exists for a reason; the tool path can't silently bypass it.

🔴 Blocking — scope needs explicit staging against overlapping work

RFC #7232 proposed splitting this into safe slices: metadata attribution + turn_id + bridge projection + OTel correlation first, then content capture policy separately. This PR still combines expanded schema, content policy, OTel rewrite, bridge refactor, runtime changes, gateway changes, and docs in a single high-risk XL diff. The body now acknowledges that loop_.rs and tool_execution.rs keep direct ObserverEvent emission with turn_id: None, and that the actual zeroclaw agent -a <alias> -m ... path still goes through loop_.rs rather than the newly-attributed agent.rs::turn() wrapper.

That caveat is load-bearing, but it contradicts the headline claim that "every downstream observer gets channel attribution, agent identity, and turn correlation." If the most common local path (CLI via loop_.rs) isn't covered, either narrow the PR to what actually ships now, or update the body with an explicit dependency/supersede/follow-up plan against the overlapping observability work: #6641 (LLM I/O observability), #6642 (turn-level metadata), #6966 (OTel integration gaps), #6190 (observer unification), #7151 (trace context propagation), and #7221 (attribution completeness).

The merge target should be clear enough that reviewers can tell what lands, what remains out of scope, and which open issues/PRs this replaces or depends on. Right now it's ambiguous whether this PR supersedes the others or whether they're all meant to land in sequence, and the mixed scope makes the privacy/security review surface harder to bound.

🟡 Warning — docs default contradicts schema

docs/book/src/ops/observability.md lines 36–43 show log_llm_io_max_chars = 10000 as an example config block, but the schema default at schema.rs:157 and the PR body both set it to 200. Operators who copy-paste the docs example will capture 50× more content than the intended safe default. Please align the docs with the shipped default (200), and add the two new LLM-related keys to the "Defaults:" summary paragraph (currently lines 45–49) so they're discoverable.

✅ Resolved — CLI attribution spans are in place where agent.rs is reached

The third commit (2719490) added run_single() and run_interactive() wrappers that open an info_span! built from self.attribution_fields() and wrap the turn() call with .instrument(span), carrying agent_alias, provider, model, and channel = "cli". That resolves @singlerider's first blocker for the subset of callers that reach agent.rs::turn. The author correctly notes that the actual CLI path (zeroclaw agent -a <alias> -m ...) still goes through loop_.rs::run, so fuller CLI observability is deferred to the follow-up — that's a valid boundary if the PR narrows its claims accordingly (see staging blocker above).

✅ Resolved — duplicate AgentStart removed

The dispatch-level AgentStart and AgentEnd records (and their supporting timer/provider resolution) are gone. turn() is now the single emission site for the agent.rs path. That resolves @singlerider's second blocker.

🟢 What looks good — OTel parent-child correlation design

The OtelObserver rewrite (lines 180–350) produces a single trace per agent turn with correct parent-child linking: AgentStart opens a parent span keyed by turn_id in active_agent_spans, LlmRequest/LlmResponse/ToolCallStart/ToolCall create child spans via parent_cx_for(turn_id), and AgentEnd closes the parent with cached I/O, token usage, cost, and duration. When turn_id is absent, spans degrade gracefully to orphans (backward compatible). All span attributes follow OTel Gen-AI semantic conventions. This is the right shape for trace correlation — the blockers are about ensuring the content that flows into those spans respects the policy boundaries the PR claims to enforce.

🟢 What looks good — ContentProcessor security ordering (where it's actually used)

ContentProcessor (would be at content_processing.rs if the file existed in this commit) does the policy layering correctly: off-policy short-circuits to None, content is pre-capped before scanning, LeakDetector::scan() runs on the capped content, and display truncation for redacted mode happens after leak scanning — so truncation can't bisect a secret and leak a fragment. Full mode is still leak-scanned (never raw). Unknown policy values now warn before falling back to safe defaults. The praise here is qualified: this design is solid where it is invoked, but the tool path bypasses it entirely (see first blocker).

🟢 What looks good — observer_bridge projection robustness

observer_bridge::project() extracts channel, agent_alias, and turn_id from log attributes via unwrap_or_default() and converts empties to None — no panicking unwraps, graceful degradation when attribution is absent. It forwards the three attribution fields across all expanded variants. This is the right shape for the single-projection-layer goal.


Summary: The OTel correlation architecture is sound, the CLI attribution wrappers are in place where they apply, and the duplicate AgentStart is resolved. But the tool execution path still bypasses the content policy entirely (exposing full tool I/O to OTel backends even when log_tool_io = "off"), and the PR's scope remains ambiguously staged against the overlapping observability work. Resolve those two blockers (policy enforcement on the tool path, and explicit scoping/sequencing plan) and address the docs default mismatch, and this will be ready for another pass.

@FTDGRT

FTDGRT commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Hi,@singlerider @Audacity88 @WareWolf-MoonWall I agree with the review direction that this PR is too broad to merge in its current shape. Rather than trying to keep shrinking this branch in place, I plan to close or supersede this PR and open a new PR from current master for the first reviewable slice.

The new first PR will be metadata/correlation only:

  • keep the existing runtime observer.record_event emission model;
  • add channel, agent_alias, and turn_id metadata to the typed observer events;
  • generate one turn_id per covered agent turn;
  • preserve/structure token usage accounting as trace metadata, including input/output token counts where available;
  • keep and improve observer_bridge projection coverage for the affected events so log-event projections stay aligned with the new typed schema;
  • update OtelObserver so events with the same turn_id appear under one turn-level parent trace;
  • keep LLM prompt/response capture out of scope;
  • keep tool input/output behavior unchanged from master.

A clarification on the tool I/O blocker: the fact that tool arguments/results are present on typed ObserverEvent::ToolCall and exported by the current OTel observer is not introduced by #7233; that behavior already exists on master. I agree the privacy/security boundary is important, but for the replacement Phase 1 PR I do not want to silently redesign that behavior while also introducing turn correlation. The first PR will either leave that behavior exactly as-is, or — if reviewers want the content-processing policy to cover tool calls in Phase 1 — I can include that explicitly in the Phase 1 plan and call it out as part of the scope before implementing it.

What I will not carry into the replacement Phase 1 PR:

  • no record! migration for runtime observer events;
  • no claim that observer_bridge is the single authoritative runtime emission path;
  • no LLM input/output content capture;
  • no user_message / response_content event fields;
  • no new LLM I/O config surface;
  • no broad claim that every runtime path is correlated.

If this boundary sounds right, I will prepare the smaller master-based PR and update this PR accordingly so the broad branch no longer blocks review on multiple concerns at once.

@FTDGRT FTDGRT closed this Jun 8, 2026
Audacity88 pushed a commit that referenced this pull request Jun 11, 2026
…te OTel spans by turn_id (#7385)

Supersedes #7233. RFC: #7232.

- add optional channel, agent_alias, and turn_id metadata to observer events
- replace AgentEnd token totals with structured TurnTokenUsage
- assign agent_alias and turn_id in agent turn execution
- enrich observer_bridge projection with turn metadata
- correlate OTel agent, LLM, and tool spans by turn_id

This is the metadata/correlation-only slice; it does not add LLM content capture, ContentProcessor, LeakDetector, or record! migration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Auto scope: src/agent/** changed. channel Auto scope: src/channels/** changed. config Auto scope: src/config/** changed. docs Auto scope: docs/markdown/template files changed. enhancement New feature or request gateway Auto scope: src/gateway/** changed. observability Auto scope: src/observability/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: XL Auto size: >1000 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants