feat(observability): Structured Observability Enhancement — Rich Events, OTel Trace Correlation, and Bridge Refactoring#7233
Conversation
69cad80 to
3dc516f
Compare
singlerider
left a comment
There was a problem hiding this comment.
RFC #7232 is solid and most of this is well-built — the ContentProcessor policy/leak-scan ordering is careful, the OTel turn-correlation design is right, and the gateway WS attribution span is correctly implemented. But the attribution story is incomplete on the most common local path, the PR introduces duplicate AgentStart emissions, and there's unrelated scope creep. These need to be resolved before it lands.
🔴 New AgentStart/AgentEnd records are un-attributed on the CLI path
agent.rs contains no attribution_span! / .instrument(...) anywhere, yet the body claims AgentStart/AgentEnd are "wrapped by attribution_span! to ensure agent alias propagates." That holds only for the gateway WS path, which correctly opens the span (ws.rs:824, carrying agent_alias/model_provider/model/channel = "wss" via attribution_fields() and .instrument(span) — nicely done). But turn() itself opens no span, and its two real callers don't either:
run_single()(agent.rs:3310) →self.turn(...)with no surrounding span.run_interactive()(agent.rs:3329) →self.turn(...)in the REPL loop, no span.- The CLI dispatch fn that calls them (around agent.rs:3401) emits its own
AgentStartand also opens noattribution_span!.
So on zeroclaw agent -a <alias> -m ... (and interactive), the new AgentStart/AgentEnd record! calls fire with model/model_provider/turn_id in attrs but no agent_alias in the attribution span — exactly the "NEVER log un-attributed" case AGENTS.md calls out, and it means the feature's headline (per-event agent identity) silently doesn't hold for the most common local invocation. The fix is to open a attribution_span!/scope-bearing turn span inside turn() (or at the run_single/run_interactive/dispatch entry) using attribution_fields(), mirroring what the WS path already does, so all entry points carry attribution rather than just the gateway.
🔴 Duplicate AgentStart per turn
master's agent.rs emits zero AgentStart records; this PR adds three. On the CLI path both the dispatch fn (≈3401) and turn() (2040) now emit AgentStart for a single user turn, so downstream observers (SSE/dashboard/OTel) will see two starts per turn. Pick one authoritative emission site — most likely turn(), with the dispatch-level one removed — so the event stream stays one-start-per-turn. This pairs with the attribution fix above: whichever site survives is the one that needs the span.
🔴 Unrelated Ollama change — drop it from this PR
The diff touches crates/zeroclaw-providers/src/ollama.rs with two changes that have nothing to do with observability: removing let temperature = temperature.unwrap_or(self.default_temperature()); from chat(), and swapping a test's tokio::spawn → zeroclaw_spawn::spawn!. This overlaps the contested Option<f64> temperature area (#7095/#7231/#7213) and doesn't belong in an observability feature. Please pull it out so this PR's blast radius matches its title; if the Ollama change is wanted, it's its own focused PR.
🟢 ContentProcessor security ordering is correct
content_processing.rs does this right: off-policy short-circuits to None (no capture), content is pre-capped to PRE_SANITIZATION_CAP_CHARS before scanning, LeakDetector::scan runs on the capped content, and the display-truncation for Redacted happens on the already-redacted string — so truncation can't bisect a secret and leak a fragment, and Full is still leak-scanned (never raw). LlmIoPolicy::from_raw / ToolIoPolicy::from_raw default unknown input to the safe option (Off / Redacted). Good secure-by-default posture.
🟢 observer_bridge projection is robust
project() extracts channel (with channel_type fallback), agent_alias, and turn_id from log attributes via unwrap_or_default() and converts empties to None — no unwraps that can panic, graceful degradation when attribution is absent. It forwards the three attribution fields across all the expanded variants. This is the right shape for the single-projection-layer goal.
Two smaller notes
log_llm_io/log_tool_ioareStringconfig fields parsed into theLlmIoPolicy/ToolIoPolicyenums at the boundary. The match logic uses the enums (good), and it follows the existingtool_ioprecedent, so it's consistent — but a typo'd value ("redactd") silently resolves to the safe default with no warning. A one-line "unknown log_llm_io value, defaulting to off" warning at config resolution would save an operator a confusing "why is nothing captured" debugging session. Non-blocking.- The two commit messages are headline-only (no body) for a ~1100-line net feature; a short body per commit would help the eventual squash. The first headline is also missing a space after the colon (
feat(observability):Structured). Non-blocking.
The architecture and the security-sensitive pieces are good — the blockers are attribution completeness on the CLI path, the duplicate AgentStart, and the unrelated Ollama edit. Resolve those (and ideally the warning + commit-body nits) and I'll take another pass.
| let user_msg_content = self.content_processor.process_user_message(user_message); | ||
| ::zeroclaw_log::record!( | ||
| INFO, | ||
| ::zeroclaw_log::Event::new(module_path!(), ::zeroclaw_log::Action::AgentStart) |
There was a problem hiding this comment.
🔴 This new AgentStart (and the paired AgentEnd on the exit paths of turn()) is un-attributed on the CLI path. agent.rs opens no attribution_span! anywhere, and turn()'s two real callers — run_single (≈3310) and run_interactive (≈3329) — don't either, so on zeroclaw agent -a <alias> -m ... these records fire with model/model_provider/turn_id in attrs but no agent_alias on the span. That's the AGENTS.md "never log un-attributed" case, and it means the feature's per-event agent identity silently doesn't hold for the most common local path.
The gateway WS path already does this correctly at ws.rs:824 — mirror it on the CLI entry points using the existing attribution_fields() helper. In run_single:
pub async fn run_single(&mut self, message: &str) -> Result<String> {
use ::zeroclaw_log::Instrument as _;
let (alias, provider, model) = self.attribution_fields();
let span = ::zeroclaw_log::info_span!(
target: "zeroclaw_log_internal_scope",
"zeroclaw_scope",
agent_alias = %alias,
model_provider = %provider,
model = %model,
channel = "cli",
);
self.turn(message).instrument(span).await
}Do the same per-turn inside run_interactive's loop (build the span from self.attribution_fields() each iteration, then self.turn(&msg.content).instrument(span).await). That carries agent_alias + channel = "cli" into every record this turn() emits, matching what the WS span already provides.
There was a problem hiding this comment.
Fixed. run_single and run_interactive now both open an info_span! built from self.attribution_fields() and wrap the turn() call with .instrument(span), exactly mirroring the WS path. channel = "cli" is set on both. As noted above, the actual zeroclaw agent -a <alias> -m ... CLI path goes through loop_.rs::run rather than these methods, so fuller CLI observability will come in the follow-up PR — but every caller that does reach agent.rs::turn is now attributed.
| }); | ||
| ::zeroclaw_log::record!( | ||
| INFO, | ||
| ::zeroclaw_log::Event::new(module_path!(), ::zeroclaw_log::Action::AgentStart) |
There was a problem hiding this comment.
🔴 Duplicate AgentStart. master's agent.rs emits zero AgentStart records; this PR adds three, and on the CLI path both this dispatch-level emission and the new one inside turn() (line 2040) fire for a single user turn — so SSE/dashboard/OTel observers see two starts per turn.
Pick one authoritative site. turn() is the better home (it owns the turn_id correlation key and runs on every entry point), so I'd drop this dispatch-level AgentStart and its paired AgentEnd below. Once the run_single/run_interactive spans from the other comment are in place, turn()'s records carry full attribution and this one is redundant.
There was a problem hiding this comment.
Removed. The dispatch-level AgentStart and AgentEnd records (and the start = Instant::now() timer and provider/model resolution that existed only to feed them) are gone. turn() is the single emission site.
…ts, OTel Trace Correlation, and Bridge Refactoring
3dc516f to
2719490
Compare
Hi,@singlerider Thanks for the detailed review and the clear blocker descriptions — they made the fixes straightforward. 🔴 Blocker 1 — CLI path attributionAddressed in the latest commit (
One thing worth noting: 🔴 Blocker 2 — Duplicate
|
Audacity88
left a comment
There was a problem hiding this comment.
Reviewed current head 2719490 against the PR body, RFC #7232, the prior singlerider review, the latest author replies, and the public diff, checks, labels, and review metadata. I agree that the duplicate AgentStart issue is fixed, the agent.rs run_single / run_interactive wrappers now carry a CLI attribution span, and the Ollama file is no longer in this diff. I am still requesting changes because the PR still has an unsafe direct tool-I/O path, and because the merge plan needs a clearer boundary between the trace-correlation slice, the content-capture policy, and the overlapping open observability work.
✅ Resolved — duplicate AgentStart and the agent.rs wrapper attribution
The dispatch-level AgentStart / AgentEnd records are gone, so turn() is now the single emission site on that agent.rs path. run_single() and run_interactive() also now wrap calls to turn() in a span carrying agent_alias, provider, model, and channel = "cli". That resolves the duplicate-start finding and fixes attribution for callers that actually reach agent.rs::turn.
🔴 Blocking — direct tool execution still bypasses the content policy
tool_execution.rs now serializes full tool arguments once and sends them directly through ObserverEvent::ToolCallStart / ObserverEvent::ToolCall with arguments: Some(full_args.clone()). On success and failure it also sends the tool result/error text through ObserverEvent::ToolCall. Those values do not pass through ContentProcessor, do not honor log_tool_io = "off", and do not apply the denylist. The result text gets the older scrub_credentials() treatment before the typed observer event, but arguments are not leak-scanned at all.
That matters because otel.rs copies those same fields into trace attributes such as gen_ai.tool.arguments, input.value, gen_ai.tool.result, and output.value. So enabling OTel can still expose full tool input/output from this path even when the PR body says captured content is governed by ContentProcessor and LeakDetector::scan().
Please either keep the direct ObserverEvent tool paths metadata-only until the follow-up migration, or route both tool arguments and tool results through the same log_tool_io policy, leak detector, truncation, and denylist before they are emitted. This path also still emits channel: None, agent_alias: None, and turn_id: None, so it does not participate in the new correlation story.
There is a smaller version of the same issue in agent.rs: execute_tool_call() processes the tool result with process_tool_result(), but ToolCallStart still records args_json directly. Tool arguments need the same policy gate as results.
🔴 Blocking — the PR still needs a staged routing decision for #7232
In #7232, the safest routing was to keep the first slice focused on metadata attribution, turn_id, bridge projection, and OTel correlation, with content capture policy and broader runtime migration handled separately unless this PR explicitly supersedes or sequences the related work. That split is not just process overhead: the content-capture policy has a different privacy review surface than trace correlation, and the open observability PRs/issues already cover parts of this same area. This PR still combines content capture policy, expanded ObserverEvent schema, OTel trace rewrite, bridge projection, runtime changes, gateway changes, and docs in one high-risk XL diff.
The body also now contains an important caveat: loop_.rs and tool_execution.rs intentionally keep direct ObserverEvent emission with turn_id: None, and the author notes that the actual zeroclaw agent -a <alias> -m ... path still goes through loop_.rs. That is a valid follow-up boundary only if the PR narrows its headline claims and routing. As written, it still says every downstream observer gets channel attribution, agent identity, and turn correlation.
Please either narrow this PR to the first safe slice, or update the body to make the dependency / supersede / follow-up plan explicit against the overlapping work: #6641, #6642, #6966, #6190, #7151, and #7221. The merge target should be clear enough that reviewers can tell what lands now, what remains intentionally out of scope, and which open PRs/issues this PR replaces or depends on.
🟡 Warning — docs default does not match the schema
The schema and PR body set log_llm_io_max_chars to the safe default of 200, but docs/book/src/ops/observability.md shows log_llm_io_max_chars = 10000, and the defaults paragraph below it omits the new LLM defaults. Please align the docs with the shipped defaults so operators do not copy a much larger capture limit by accident.
🟢 What looks good — the policy shape is solid where it is actually used
The ContentProcessor design itself is good: LLM capture defaults off, redacted mode scans before truncating, full mode still leak-scans, and unknown policy values now warn before falling back to safe defaults. The bridge and OTel parent/child shape are also pointed in the right direction. The remaining blockers are about making the actual emission paths match those guarantees and staging the PR so the high-risk observability surface can be reviewed safely.
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
Reviewed head commit 2719490 against the PR body, RFC #7232, both prior reviews from @singlerider and @Audacity88, author responses in inline threads, the diff, CI status, and FND-006 §4.6 (Observability as Debuggability). This PR makes meaningful progress on turn correlation and OTel trace hierarchy, but two blocking issues remain: the tool execution path bypasses the content policy entirely, and the PR's scope still mixes multiple high-risk observability concerns without a clear staging plan.
🔴 Blocking — tool I/O bypasses ContentProcessor and policy enforcement
tool_execution.rs lines 57–61 and 173–232 emit ObserverEvent::ToolCallStart and ObserverEvent::ToolCall with full arguments: Some(full_args.clone()) and result: Some(...) fields. These go directly to the observer with zero policy gating — no ContentProcessor, no log_tool_io check, no denylist filter, no LeakDetector::scan(). The only sanitization is the legacy scrub_credentials() on results, which predates the new leak detector.
That matters because otel.rs (lines 200–350 in this PR's version) copies those same arguments and result fields into OTel span attributes like gen_ai.tool.arguments, input.value, gen_ai.tool.result, and output.value. So when an operator enables OTel but leaves log_tool_io = "off" (the safe default), tool I/O still flows to the OTel backend unredacted. The PR body claims "captured content is governed by ContentProcessor and LeakDetector::scan()" — that's true for LLM I/O routed through agent.rs, but false for this tool path.
The fix: either keep tool_execution.rs metadata-only (arguments: None, result: None) until the follow-up migration the PR body mentions, or route both tool arguments and results through the same log_tool_io policy, leak detector, truncation, and denylist before emitting ObserverEvent. The agent.rs path at line ~1850 also has a smaller version of this: execute_tool_call() processes the result via process_tool_result(), but ToolCallStart still records args_json directly without policy gating.
This isn't theoretical — per FND-006 §4.6, observability infrastructure only translates into diagnosable systems when contributors use it with the discipline to protect what goes out. The content policy exists for a reason; the tool path can't silently bypass it.
🔴 Blocking — scope needs explicit staging against overlapping work
RFC #7232 proposed splitting this into safe slices: metadata attribution + turn_id + bridge projection + OTel correlation first, then content capture policy separately. This PR still combines expanded schema, content policy, OTel rewrite, bridge refactor, runtime changes, gateway changes, and docs in a single high-risk XL diff. The body now acknowledges that loop_.rs and tool_execution.rs keep direct ObserverEvent emission with turn_id: None, and that the actual zeroclaw agent -a <alias> -m ... path still goes through loop_.rs rather than the newly-attributed agent.rs::turn() wrapper.
That caveat is load-bearing, but it contradicts the headline claim that "every downstream observer gets channel attribution, agent identity, and turn correlation." If the most common local path (CLI via loop_.rs) isn't covered, either narrow the PR to what actually ships now, or update the body with an explicit dependency/supersede/follow-up plan against the overlapping observability work: #6641 (LLM I/O observability), #6642 (turn-level metadata), #6966 (OTel integration gaps), #6190 (observer unification), #7151 (trace context propagation), and #7221 (attribution completeness).
The merge target should be clear enough that reviewers can tell what lands, what remains out of scope, and which open issues/PRs this replaces or depends on. Right now it's ambiguous whether this PR supersedes the others or whether they're all meant to land in sequence, and the mixed scope makes the privacy/security review surface harder to bound.
🟡 Warning — docs default contradicts schema
docs/book/src/ops/observability.md lines 36–43 show log_llm_io_max_chars = 10000 as an example config block, but the schema default at schema.rs:157 and the PR body both set it to 200. Operators who copy-paste the docs example will capture 50× more content than the intended safe default. Please align the docs with the shipped default (200), and add the two new LLM-related keys to the "Defaults:" summary paragraph (currently lines 45–49) so they're discoverable.
✅ Resolved — CLI attribution spans are in place where agent.rs is reached
The third commit (2719490) added run_single() and run_interactive() wrappers that open an info_span! built from self.attribution_fields() and wrap the turn() call with .instrument(span), carrying agent_alias, provider, model, and channel = "cli". That resolves @singlerider's first blocker for the subset of callers that reach agent.rs::turn. The author correctly notes that the actual CLI path (zeroclaw agent -a <alias> -m ...) still goes through loop_.rs::run, so fuller CLI observability is deferred to the follow-up — that's a valid boundary if the PR narrows its claims accordingly (see staging blocker above).
✅ Resolved — duplicate AgentStart removed
The dispatch-level AgentStart and AgentEnd records (and their supporting timer/provider resolution) are gone. turn() is now the single emission site for the agent.rs path. That resolves @singlerider's second blocker.
🟢 What looks good — OTel parent-child correlation design
The OtelObserver rewrite (lines 180–350) produces a single trace per agent turn with correct parent-child linking: AgentStart opens a parent span keyed by turn_id in active_agent_spans, LlmRequest/LlmResponse/ToolCallStart/ToolCall create child spans via parent_cx_for(turn_id), and AgentEnd closes the parent with cached I/O, token usage, cost, and duration. When turn_id is absent, spans degrade gracefully to orphans (backward compatible). All span attributes follow OTel Gen-AI semantic conventions. This is the right shape for trace correlation — the blockers are about ensuring the content that flows into those spans respects the policy boundaries the PR claims to enforce.
🟢 What looks good — ContentProcessor security ordering (where it's actually used)
ContentProcessor (would be at content_processing.rs if the file existed in this commit) does the policy layering correctly: off-policy short-circuits to None, content is pre-capped before scanning, LeakDetector::scan() runs on the capped content, and display truncation for redacted mode happens after leak scanning — so truncation can't bisect a secret and leak a fragment. Full mode is still leak-scanned (never raw). Unknown policy values now warn before falling back to safe defaults. The praise here is qualified: this design is solid where it is invoked, but the tool path bypasses it entirely (see first blocker).
🟢 What looks good — observer_bridge projection robustness
observer_bridge::project() extracts channel, agent_alias, and turn_id from log attributes via unwrap_or_default() and converts empties to None — no panicking unwraps, graceful degradation when attribution is absent. It forwards the three attribution fields across all expanded variants. This is the right shape for the single-projection-layer goal.
Summary: The OTel correlation architecture is sound, the CLI attribution wrappers are in place where they apply, and the duplicate AgentStart is resolved. But the tool execution path still bypasses the content policy entirely (exposing full tool I/O to OTel backends even when log_tool_io = "off"), and the PR's scope remains ambiguously staged against the overlapping observability work. Resolve those two blockers (policy enforcement on the tool path, and explicit scoping/sequencing plan) and address the docs default mismatch, and this will be ready for another pass.
|
Hi,@singlerider @Audacity88 @WareWolf-MoonWall I agree with the review direction that this PR is too broad to merge in its current shape. Rather than trying to keep shrinking this branch in place, I plan to close or supersede this PR and open a new PR from current The new first PR will be metadata/correlation only:
A clarification on the tool I/O blocker: the fact that tool arguments/results are present on typed What I will not carry into the replacement Phase 1 PR:
If this boundary sounds right, I will prepare the smaller |
…te OTel spans by turn_id (#7385) Supersedes #7233. RFC: #7232. - add optional channel, agent_alias, and turn_id metadata to observer events - replace AgentEnd token totals with structured TurnTokenUsage - assign agent_alias and turn_id in agent turn execution - enrich observer_bridge projection with turn metadata - correlate OTel agent, LLM, and tool spans by turn_id This is the metadata/correlation-only slice; it does not add LLM content capture, ContentProcessor, LeakDetector, or record! migration.
RFC: #7232
Summary
What changed and why: ZeroClaw's observability surface had three gaps — sparse event context (no channel/agent attribution, no LLM I/O, no structured token breakdown), flat uncorrelated OTel spans (one independent span per event, no parent-child linking), and inconsistent emission architecture (two divergent paths with partial bridge coverage). This PR addresses all three in a coordinated change.
Scope boundary: This PR does not migrate
loop_.rsandtool_execution.rsto therecord!emission path — they continue to emitObserverEventdirectly withturn_id: None. That migration is planned for a follow-up PR.Blast radius: Enriched event context gives every downstream observer channel attribution, agent identity, and turn correlation — making dashboards, alerts, and debugging significantly more actionable. OTel backends now receive correlated traces per turn instead of orphaned spans, enabling proper latency analysis and span navigation. The agent runtime's direct
observer.record_event()calls inagent.rsare migrated to therecord!macro path, makingobserver_bridgethe single authoritative projection from log events to typed observer events. TheObserverEventschema expands withOption<_>fields, preserving backward compatibility — downstream observers using..in match arms require no changes.Changes
Part 1: Enriched ObserverEvent Schema
Six core event variants gain three common attribution fields plus variant-specific content fields:
channel: Option<String>,agent_alias: Option<String>,turn_id: Option<String>added toAgentStart,LlmRequest,LlmResponse,AgentEnd,ToolCallStart,ToolCallTurnTokenUsage { input_tokens, output_tokens }replaces the flattokens_used: Option<u64>inAgentEnd, pluscost_usd: Option<f64>LlmRequestgainsuser_message: Option<String>LlmResponsegainsresponse_content: Option<String>ToolCallStartgainstool_call_id,arguments: Option<String>ToolCallgainstool_call_id,arguments,result: Option<String>Part 2: Content Policy Enforcement
A new
ContentProcessormodule governs what content enters observability events:LlmIoPolicy(Off / Redacted / Full) andToolIoPolicy(Off / Redacted / Full)LeakDetector::scan()runs on all captured content before emissionRedactedmode: leak-scan + truncate to configurable char limitFullmode: leak-scan but no truncationOff— no LLM I/O captured unless operator explicitly opts inNew config keys:
Part 3: OTel Trace Correlation via turn_id
The
OtelObserveris rewritten to produce a single trace per agent turn:AgentStartopens a parent span (gen_ai.agent.invoke), stored inactive_agent_spanskeyed byturn_idLlmRequest,LlmResponse,ToolCallStart,ToolCallcreate child spans under the parent viaparent_cx_for(turn_id)AgentEndcloses the parent span with cached I/O, token usage, cost, and durationgen_ai.provider.name,gen_ai.usage.input_tokens, etc.)turn_idis absent, spans degrade gracefully to orphans (backward compatible)flush()cleans up orphaned spans from unclosed turnsPart 4: Observer Bridge Refactoring
observer_bridge::project()expanded to a complete projection layer:channel,agent_alias,turn_idfromLogEventspan attribution and attributesToolCallStartandLlmRequestvariants from log events to typedObserverEventPart 5: Agent Runtime Overhaul
The largest change in this PR —
agent.rsis refactored to integrate the new observability surface:agent_aliasfield toAgentstruct andAgentBuilder; builder methodagent_alias()sets it,from_config_*reads it from configturn_id()helper generates a UUID per turn, serving as the OTel correlation keyobservability::create_observer()and installed into the observer bridge viazeroclaw_log::set_observer_bridge()AgentStartandAgentEndare now recorded on all exit paths in turn methods, wrapped byattribution_span!to ensure agent alias propagates through the span treePart 6: Gateway Channel Attribution
tracing::info_span!withagent_alias,model_provider,model,channel = "wss"and wrap the turn in.instrument(span)agent_aliasis now a required query parameter on the WS endpointToolCallStartvariant and expandedAgentEnd(withTurnTokenUsage+cost_usd)Observer Implementations Updated
PrometheusObserver: token gauge computed fromTurnTokenUsage { input_tokens + output_tokens }LogObserver: all match arms updated with new attribution fields (channel,agent_alias,turn_id)VerboseObserver: match arms updated to accept new fields with..NoopObserver: match arms updated to accept expanded variantsValidation Evidence
zeroclaw daemon, chatting via http://127.0.0.1:42617/agents, and confirming enriched trace data (turn correlation, token breakdown, channel attribution) appears on the configured Langfuse backend. The trace structure andllm.responseexample shown in Langfuse are as follows:Security & Privacy Impact
otel_headersmarked#[secret]for encryption at rest;ContentProcessorwithLeakDetector::scan()redacts detected secrets before captureCompatibility
ObserverEventfields areOption<_>withNonedefaults; new config keys have safe defaults (log_llm_io = "off")[observability]section (log_llm_io,log_llm_io_max_chars). All optional with safe defaults. No upgrade steps required — existing configs work unchanged.Rollback
risk: high