[Discussion] Agent observability: tracing & evaluation beyond metrics + event log #710

addu390 · 2026-05-27T23:41:56Z

addu390
May 27, 2026

Context

Flink Agents today ships three observability surfaces:

Metrics: event/action counters, token usage per model, plus a custom-metrics API, via Flink metric reporter.
Event log: structured records per event, with log levels.
EventListener: per-event callback, configured via event-listeners.

That covers aggregates (metrics), audit records (event log), and hooks (listener). What it doesn't cover is reconstructing a single run as a causal tree:

InputEvent
 └─ action: classify
      └─ ChatRequest → ChatResponse
 └─ action: tool_use
      └─ ToolRequest → ToolResponse
      └─ ChatRequest → ChatResponse
 └─ OutputEvent

That tree shape is what LangSmith (and Langfuse, Phoenix, etc) render for debugging, or where MLflow slots into for batch eval runs (one run = one trace, with metrics, prompts, and outputs tracked across versions of the agent). There is also OpenTelemetry GenAI semantic conventions. Makes debugging and battle-testing non-trivial agent workflows tractable.

Scope

Not proposing this as a default for production streaming jobs. Full-fidelity per-event tracing at streaming QPS is too heavy, metrics + event log stay the right production defaults.

The question is whether the framework should provide first-class support for tracing where it actually pays off:

Local authoring loop
CI / batch eval and replay
Staging mini-cluster runs
Canaried / sampled production

In all four, you want a single run rendered end-to-end. Today you piece it together from event-log records.

Open question

Is this worth the framework solving? Curious if others hit this in their own dev loop.

xintongsong · 2026-05-28T10:20:00Z

xintongsong
May 28, 2026
Collaborator

Hi @addu390,

+1 on this. Being able to reconstruct and visualize the trace of a single run is really helpful for observing and understanding agent behavior, and it's exactly the gap that neither metrics nor the event log fills today.

I'd suggest splitting this into two fairly independent problems:

Recording: keep enough information in the event log to reconstruct the causal tree.
Reconstruction & visualization: rebuild and render a single run from that information.

Once you split it this way, the overhead concern mostly lands on the first part, and I think that part is actually pretty light. We already record every event in the event log. To reconstruct the causal tree, we basically just need one extra field per event: which action emitted the event. With that, the whole run can be rebuilt from the event log.

The reverse edges (which actions an event triggered) don't even need to be recorded explicitly. They can be derived from the action trigger rules plus timestamps. So the increment on the recording side is small, and I think it can be kept separate from the concern about full-fidelity tracing being too heavy at streaming QPS.

The second part can be fully on-demand rather than always-on. We only run it when needed, e.g. local debugging, CI/eval, or staging. The rendering form doesn't have to be a tree either; PlantUML or something else would work too, and we can discuss that separately. As long as the recording side captures the necessary info, there's a lot of flexibility in what we do with it later.

4 replies

addu390 May 28, 2026
Author

Hi @xintongsong. Agreed on the split.

The overhead (additional CoP) concern I raised was coming from also wanting to be OTel-compatible on the recording side, so existing AI observability tools can pick up our runs out of the box, e.g. how LangSmith does it: trace-with-opentelemetry.

But as you pointed out, we already track almost everything needed, a few additional fields on the event log would be enough to render the full run trace. If those land as the log, the overhead concern goes away.

Any hard takes on OTel compatibility and to use GenAI semantic conventions, or keep it framework-native for now?

xintongsong May 28, 2026
Collaborator

Hi @addu390,

I don't have a strong opinion on OTel compatibility itself; it mostly comes down to whether there's clear demand from users.

On the implementation side, OTel could plug in as just another logger for the event log. The event log already writes through pluggable loggers (currently SLF4J and FILE), so OTel could sit alongside them as a third option. For users it would just be picking an output backend among otel / slf4j / file, which means the same data isn't recorded twice and runtime overhead shouldn't change much. One caveat: wiring OTel in might require some changes to the current plugin interface.

I slightly lean toward waiting until there's clear demand before building the OTel logger, but I'm also okay with you or anyone else picking this up now, as long as it doesn't affect existing functionality or performance.

addu390 May 29, 2026
Author

@xintongsong Sounds good, I don't have a strong take either. I was experimenting to potentially migrate a few workloads to flink-agents and tracing did come across as bit of a gap.

A good starter would be to get the trace-tree per run, right within the logs. I'll create an issue for that? and down to pick that up too.

xintongsong May 31, 2026
Collaborator

Sounds great. Please feel free to move ahead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Agent observability: tracing & evaluation beyond metrics + event log #710

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Discussion] Agent observability: tracing & evaluation beyond metrics + event log #710

Uh oh!

addu390 May 27, 2026

Context

Scope

Open question

Replies: 1 comment · 4 replies

Uh oh!

xintongsong May 28, 2026 Collaborator

Uh oh!

Uh oh!

addu390 May 28, 2026 Author

Uh oh!

xintongsong May 28, 2026 Collaborator

Uh oh!

Uh oh!

addu390 May 29, 2026 Author

Uh oh!

xintongsong May 31, 2026 Collaborator

addu390
May 27, 2026

Replies: 1 comment 4 replies

xintongsong
May 28, 2026
Collaborator

addu390 May 28, 2026
Author

xintongsong May 28, 2026
Collaborator

addu390 May 29, 2026
Author

xintongsong May 31, 2026
Collaborator