Skip to content

Conversation

@Fant1k34
Copy link

@Fant1k34 Fant1k34 commented Jan 19, 2026

Motivation and Context

PR introduces metrics support based on OpenTelemetry specification. Such metrics allow their listeners observe the behaviour of agent in run-time. All changes are done within OpenTelemetry feature

Link to issue

Added configuration:

  • addMeterExporter(MetricExporter) -- function within OpenTelemetry feature that accept MetricExporter and sent library metrics via such exporter

Example of OpenTelemetry feature configuration:

install(OpenTelemetry) {
    setServiceInfo(
        "calculator",
        "0.0.1"
    )
    addResourceAttributes(
        mapOf(AttributeKey.stringKey("service.instance.id") to "run-1")
    )
    addMeterExporter(
        OtlpGrpcMetricExporter.builder()
            .setEndpoint("http://localhost:17011")
            .setTimeout(1, TimeUnit.SECONDS)
            .build()
        )
    }

Added metrics:

  • gen_ai.client.token.usage -- see here
  • gen_ai.client.operation.duration -- see here
  • gen_ai.client.tool.count -- custom metric allows observe tool calls and their statuses

Visualisation (via OpenTelemetry plugin):
telegram-cloud-photo-size-2-5438313832905380635-y
image

Breaking Changes


Type of the changes

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Tests improvement
  • Refactoring
  • CI/CD changes
  • Dependencies update

Checklist

  • The pull request has a description of the proposed change
  • I read the Contributing Guidelines before opening the pull request
  • The pull request uses develop as the base branch
  • Tests for the changes have been added
  • All new and existing tests passed
Additional steps for pull requests adding a new feature
  • An issue describing the proposed change exists
  • The pull request includes a link to the issue
  • The change was discussed and approved in the issue
  • Docs have been added / updated

@Fant1k34 Fant1k34 requested a review from sdubov January 20, 2026 16:44
- Description: Total token count per operation and token type
- When emitted: after an LLM call finishes; recorded separately for input and output tokens
- Key attributes:
- `gen_ai.operation.name` (required)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(?) chat; generate_content; text_completion

- Recommended explicit bucket boundaries (seconds): 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12,
10.24, 20.48, 40.96, 81.92
- Key attributes:
- `gen_ai.operation.name` (required)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • execute_tool

- When emitted: on tool completion, failure, or validation failure
- Key attributes:
- `gen_ai.operation.name` (required) — `EXECUTE_TOOL`
- `gen_ai.tool.name` (recommended)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be cautious with dynamic values since they can affect the cardinality and performance of the system, as well as the monitoring backend. A general recommendation is that cardinality is not critical if it remains below 100. The current metric cardinality is approximately $3 + 3*n$, where $n$ is the number of tools.

To address this, let's add a method to filter tool names for the attribute, enabling us to limit the cardinality. Additionally, let's document this approach.

@Fant1k34 Fant1k34 changed the title [WIP] Ndukin/add open telemetry metric Ndukin/add open telemetry metric Jan 21, 2026
@Fant1k34 Fant1k34 changed the title Ndukin/add open telemetry metric Add open telemetry metric support in Koog Jan 21, 2026
Copy link
Contributor

@sdubov sdubov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the consistency for current metrics support implementation. Thank you. Please check my comments about some metrics implementation details.

override val value: HiddenString = HiddenString(result.toString())
}

// gen_ai.tool.status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to report this as a custom attribute? or a special koog attribute. I think it would be cleaner to easier to separate attribute if we keep gen_ai attributes consistent with ones defined in the Open Telemetry documentation. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added it as a custom koog attribute. Additionally, I extracted metric gen_ai.tool.call.count to koog.tool.count as a custom metric name to be consistent with Open Telemetry documentation as well

}
}

install(OpenTelemetry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems for me unrelated to the Open Telemetry metrics updates. Is it a leftover from testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, this is unrelated to Open Telemetry metrics feature. But I thought, that it could be great to have at least one example with configured metrics feature to check, if everything works as expected
So yes, it's a leftover from testing and I did that intentionally to have one example to test. Since it's in examples project in koog, I thought that it's fine to enable such feature here

If it's better to remove from Calculator example, I can do so!

* - gen_ai.tool.description (recommended)
* - gen_ai.tool.name (recommended)
* - gen_ai.token.type (required)
* - gen_ai.tool.status (custom)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment about the custom attribute below.

message.metaInfo.inputTokensCount?.let { inputTokensCount ->
tokensCounter.add(
inputTokensCount.toLong(),
Attributes.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not it be safer to and easier to use existing wrapper and converter from Koog Attributes to SDK attributes using PII protection and auto type conversion here. Something like this:

...
listOf(
  GenAIAttributes.Operation.Name(GenAIAttributes.Operation.OperationNameType.TEXT_COMPLETION),
  GenAIAttributes.Provider.Name(provider),
  GenAIAttributes.Token.Type(GenAIAttributes.Token.TokenType.INPUT),
  GenAIAttributes.Response.Model(eventContext.model)
).toSdkAttributes(config.isVerbose)
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would!
Thank you, replaced Attributes.builder with listOf and toSdkAttributes functions, additionally added isVerbose attribute there not to propagate config futher

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, such way of attributes creation (via listOf().toSdkAttributes(config.isVerbose)) leads to wrong behaviour of metrics
It can be observed on the screenshots (the first screenshot -- how it works with AttributesBuilder, the second -- via toSdkAttributes)

image image

For some reason, second option leads to fall of the metric, however it's cumulative one and it cannot become less that it was before

Thoroughly explore this issue by myself and with the help of chatGPT, we came to conclusion that the reason for such behaviour is custom wrapper under the Attributes class (toSdkAttributes function). For current implementation, it's not optimized and creates a new Attributes object each time. In terms of tracing feature that's totally fine, but not for metric one. Sdk should understand, what has changed from the previous timeframe, and since toSdkAttributes does create Attributes instance each time, it cannot properly link the change of metric.

This behaviour can be observed via configuring LoggingMeterExporter. Each time it prints metrics' state and it has different hashcode and object reference in the system.

Since the problem appears not in plugin (that I use for visualization purpose), but in library (sdk cannot link attributes), I suggest revert changes to AttributesBuilder.build() back


eventContext.responses.lastOrNull()?.let { message ->
message.metaInfo.inputTokensCount?.let { inputTokensCount ->
tokensCounter.add(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about extracting this logic a separate extension method (similar to builder methods to start and end spans) that will receive necessary parameters from eventContext and create attributes. It will make the feature code a bit cleaner. For example:

fun LongCounter.addInputTokensCounter(
    config: OpenTelemetryConfig, 
    inputTokensCount: Long,
    provider: LLMProvider,
    model: LLModel
) {
    this.add(
        inputTokensCount.toLong(),
        listOf<Attribute>(
            GenAIAttributes.Operation.Name(GenAIAttributes.Operation.OperationNameType.TEXT_COMPLETION),
            GenAIAttributes.Provider.Name(provider),
            GenAIAttributes.Token.Type(GenAIAttributes.Token.TokenType.INPUT),
            GenAIAttributes.Response.Model(eventContext.model)
        ).toSdkAttributes(config.isVerbose)
}

fun LongCounter.addInputTokensCounter(...) { ... }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, after implementing MetricCollector, such extracting could be excessive, as new MetricCollector class works with add or record function only once (except operationDurationHistogram -- twice, but different set of attributes). Wdyt?

@@ -0,0 +1,24 @@
package ai.koog.agents.features.opentelemetry.metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you expect other methods here? I would rename it to tokenCounter.kt to not confuse when new methods are added here as this file is to aggregate different token counter top-level methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks! I have renamed it!

@@ -0,0 +1,25 @@
package ai.koog.agents.features.opentelemetry.metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the file name.

@@ -0,0 +1,28 @@
package ai.koog.agents.features.opentelemetry.metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the file name.

get() = "Tool calls count"

override val unit: String
get() = "tool call"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit is "tool call"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, because this metric -- too calls count. So similarly to token count, metrics here is tool call


failedToolCall?.let { toolCall ->
toolCall.getDurationSec()?.let { sec ->
toolCallDurationHistogram.record(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it worth combining those two methods end + record. Let me propose an idea. Here, you want to create a new record with a specific metric. Maybe, instead of having a storage of events, you have a metrics collector class (similar to spans collector that we already have). This metrics collector also will be responsible for storing needed metrics and use this data to create a proper records. You can call the wrapper method record (or end) and it will create a record for a metric under the hood. So, you do not need to perform it with two separate steps: end + record. You will do it as a single step in this collector instead. Wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I implemented this approach as you described:

  • Introduced MetricCollector class which is responsible for storing necessary metrics, and recording them based on the MetricEvent that appears in OpenTelemetry class like that:
    metricCollector.recordEvent(
                    ToolCallEnded(
                        id = eventContext.eventId,
                        timestamp = System.currentTimeMillis(),
                        toolName = eventContext.toolName,
                        status = ToolCallStatus.FAILED
                    )
                )
  • Under the hood, MetricCollector works with MetricEventStorage that is responsible for storing matching events via methods startEvent and endEvent. This approach allows to store paired events and compute some metrics based on difference between

metricExporters.forEach { exporter ->
val reader = PeriodicMetricReader
.builder(exporter)
.setInterval(Duration.ofSeconds(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it makes sense to move this interval duration in the configuration as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does! Thanks, I added it

)

// Store llm call
eventCallStorage.addEventCall(eventContext.eventId, eventContext.model.provider.display)
Copy link
Contributor

@sdubov sdubov Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also noticed that you use this EventCallStorage to get start-end time of a metric through getDuration() API. With current implementation, the start and end time should match the timing for a corresponding span. Maybe, you can re-use the duration from the span data instead. It would allow you to get rid of this storage class with necessity to manage it at all. Like this:

(inferenceSpan.span as? ReadableSpan)?.toSpanData()?.let { spanData ->
                    val duration =
                        (spanData.endEpochNanos - spanData.startEpochNanos).toDuration(DurationUnit.NANOSECONDS)
                    toolCallDurationHistogram.record(
                        duration.inWholeSeconds.toDouble(),
                       // attributes
                }

Copy link
Author

@Fant1k34 Fant1k34 Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I agree we could get rid of storage class in this case

However, after implementing this approach I have figured out, that OpenTelemetry class that has almost 900 lines is not much cluttered with metrics handling anymore, because of convenient recordEvent method

Additionally, Metric Collector separates logic of ai agent pipeline handling and metric recording. Moreover, it's not coupled with SpanCollector class and works separately (metrics for metric, traces for traces)

One more potential benefit is possibility to monitor states between 2 events. In future, we can save some value of starting event (for example, total spend tokens) and the value of ending event (total spend tokens). And we can record the difference between these states as metric (tokens spend tokens for this LLM Call) or something like that

If we reuse the logic from span collector:

  • we would need extra logic for OpenTelemetry class to get duration from span or add span as an argument to the MetricCollector, that leads to coupled, dependent classes. Ideally, metrics and spans should not be dependent, as they provide different way to observe system/agent behaviour
  • we'll need follow strict matching "span" -> "metric" in the future, otherwise we could not record time property of operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants