-
Notifications
You must be signed in to change notification settings - Fork 310
Add open telemetry metric support in Koog #1381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…butes and fix found issues
| - Description: Total token count per operation and token type | ||
| - When emitted: after an LLM call finishes; recorded separately for input and output tokens | ||
| - Key attributes: | ||
| - `gen_ai.operation.name` (required) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(?) chat; generate_content; text_completion
| - Recommended explicit bucket boundaries (seconds): 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, | ||
| 10.24, 20.48, 40.96, 81.92 | ||
| - Key attributes: | ||
| - `gen_ai.operation.name` (required) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- execute_tool
| - When emitted: on tool completion, failure, or validation failure | ||
| - Key attributes: | ||
| - `gen_ai.operation.name` (required) — `EXECUTE_TOOL` | ||
| - `gen_ai.tool.name` (recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be cautious with dynamic values since they can affect the cardinality and performance of the system, as well as the monitoring backend. A general recommendation is that cardinality is not critical if it remains below 100. The current metric cardinality is approximately
To address this, let's add a method to filter tool names for the attribute, enabling us to limit the cardinality. Additionally, let's document this approach.
sdubov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the consistency for current metrics support implementation. Thank you. Please check my comments about some metrics implementation details.
| override val value: HiddenString = HiddenString(result.toString()) | ||
| } | ||
|
|
||
| // gen_ai.tool.status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to report this as a custom attribute? or a special koog attribute. I think it would be cleaner to easier to separate attribute if we keep gen_ai attributes consistent with ones defined in the Open Telemetry documentation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added it as a custom koog attribute. Additionally, I extracted metric gen_ai.tool.call.count to koog.tool.count as a custom metric name to be consistent with Open Telemetry documentation as well
| } | ||
| } | ||
|
|
||
| install(OpenTelemetry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems for me unrelated to the Open Telemetry metrics updates. Is it a leftover from testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, this is unrelated to Open Telemetry metrics feature. But I thought, that it could be great to have at least one example with configured metrics feature to check, if everything works as expected
So yes, it's a leftover from testing and I did that intentionally to have one example to test. Since it's in examples project in koog, I thought that it's fine to enable such feature here
If it's better to remove from Calculator example, I can do so!
| * - gen_ai.tool.description (recommended) | ||
| * - gen_ai.tool.name (recommended) | ||
| * - gen_ai.token.type (required) | ||
| * - gen_ai.tool.status (custom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comment about the custom attribute below.
| message.metaInfo.inputTokensCount?.let { inputTokensCount -> | ||
| tokensCounter.add( | ||
| inputTokensCount.toLong(), | ||
| Attributes.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not it be safer to and easier to use existing wrapper and converter from Koog Attributes to SDK attributes using PII protection and auto type conversion here. Something like this:
...
listOf(
GenAIAttributes.Operation.Name(GenAIAttributes.Operation.OperationNameType.TEXT_COMPLETION),
GenAIAttributes.Provider.Name(provider),
GenAIAttributes.Token.Type(GenAIAttributes.Token.TokenType.INPUT),
GenAIAttributes.Response.Model(eventContext.model)
).toSdkAttributes(config.isVerbose)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would!
Thank you, replaced Attributes.builder with listOf and toSdkAttributes functions, additionally added isVerbose attribute there not to propagate config futher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, such way of attributes creation (via listOf().toSdkAttributes(config.isVerbose)) leads to wrong behaviour of metrics
It can be observed on the screenshots (the first screenshot -- how it works with AttributesBuilder, the second -- via toSdkAttributes)
For some reason, second option leads to fall of the metric, however it's cumulative one and it cannot become less that it was before
Thoroughly explore this issue by myself and with the help of chatGPT, we came to conclusion that the reason for such behaviour is custom wrapper under the Attributes class (toSdkAttributes function). For current implementation, it's not optimized and creates a new Attributes object each time. In terms of tracing feature that's totally fine, but not for metric one. Sdk should understand, what has changed from the previous timeframe, and since toSdkAttributes does create Attributes instance each time, it cannot properly link the change of metric.
This behaviour can be observed via configuring LoggingMeterExporter. Each time it prints metrics' state and it has different hashcode and object reference in the system.
Since the problem appears not in plugin (that I use for visualization purpose), but in library (sdk cannot link attributes), I suggest revert changes to AttributesBuilder.build() back
|
|
||
| eventContext.responses.lastOrNull()?.let { message -> | ||
| message.metaInfo.inputTokensCount?.let { inputTokensCount -> | ||
| tokensCounter.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about extracting this logic a separate extension method (similar to builder methods to start and end spans) that will receive necessary parameters from eventContext and create attributes. It will make the feature code a bit cleaner. For example:
fun LongCounter.addInputTokensCounter(
config: OpenTelemetryConfig,
inputTokensCount: Long,
provider: LLMProvider,
model: LLModel
) {
this.add(
inputTokensCount.toLong(),
listOf<Attribute>(
GenAIAttributes.Operation.Name(GenAIAttributes.Operation.OperationNameType.TEXT_COMPLETION),
GenAIAttributes.Provider.Name(provider),
GenAIAttributes.Token.Type(GenAIAttributes.Token.TokenType.INPUT),
GenAIAttributes.Response.Model(eventContext.model)
).toSdkAttributes(config.isVerbose)
}
fun LongCounter.addInputTokensCounter(...) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, after implementing MetricCollector, such extracting could be excessive, as new MetricCollector class works with add or record function only once (except operationDurationHistogram -- twice, but different set of attributes). Wdyt?
| @@ -0,0 +1,24 @@ | |||
| package ai.koog.agents.features.opentelemetry.metric | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you expect other methods here? I would rename it to tokenCounter.kt to not confuse when new methods are added here as this file is to aggregate different token counter top-level methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks! I have renamed it!
| @@ -0,0 +1,25 @@ | |||
| package ai.koog.agents.features.opentelemetry.metric | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the file name.
| @@ -0,0 +1,28 @@ | |||
| package ai.koog.agents.features.opentelemetry.metric | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the file name.
| get() = "Tool calls count" | ||
|
|
||
| override val unit: String | ||
| get() = "tool call" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit is "tool call"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, because this metric -- too calls count. So similarly to token count, metrics here is tool call
|
|
||
| failedToolCall?.let { toolCall -> | ||
| toolCall.getDurationSec()?.let { sec -> | ||
| toolCallDurationHistogram.record( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it worth combining those two methods end + record. Let me propose an idea. Here, you want to create a new record with a specific metric. Maybe, instead of having a storage of events, you have a metrics collector class (similar to spans collector that we already have). This metrics collector also will be responsible for storing needed metrics and use this data to create a proper records. You can call the wrapper method record (or end) and it will create a record for a metric under the hood. So, you do not need to perform it with two separate steps: end + record. You will do it as a single step in this collector instead. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I implemented this approach as you described:
- Introduced
MetricCollectorclass which is responsible for storing necessary metrics, and recording them based on the MetricEvent that appears in OpenTelemetry class like that:
metricCollector.recordEvent(
ToolCallEnded(
id = eventContext.eventId,
timestamp = System.currentTimeMillis(),
toolName = eventContext.toolName,
status = ToolCallStatus.FAILED
)
)
- Under the hood,
MetricCollectorworks withMetricEventStoragethat is responsible for storing matching events via methodsstartEventandendEvent. This approach allows to store paired events and compute some metrics based on difference between
| metricExporters.forEach { exporter -> | ||
| val reader = PeriodicMetricReader | ||
| .builder(exporter) | ||
| .setInterval(Duration.ofSeconds(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it makes sense to move this interval duration in the configuration as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does! Thanks, I added it
| ) | ||
|
|
||
| // Store llm call | ||
| eventCallStorage.addEventCall(eventContext.eventId, eventContext.model.provider.display) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also noticed that you use this EventCallStorage to get start-end time of a metric through getDuration() API. With current implementation, the start and end time should match the timing for a corresponding span. Maybe, you can re-use the duration from the span data instead. It would allow you to get rid of this storage class with necessity to manage it at all. Like this:
(inferenceSpan.span as? ReadableSpan)?.toSpanData()?.let { spanData ->
val duration =
(spanData.endEpochNanos - spanData.startEpochNanos).toDuration(DurationUnit.NANOSECONDS)
toolCallDurationHistogram.record(
duration.inWholeSeconds.toDouble(),
// attributes
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, I agree we could get rid of storage class in this case
However, after implementing this approach I have figured out, that OpenTelemetry class that has almost 900 lines is not much cluttered with metrics handling anymore, because of convenient recordEvent method
Additionally, Metric Collector separates logic of ai agent pipeline handling and metric recording. Moreover, it's not coupled with SpanCollector class and works separately (metrics for metric, traces for traces)
One more potential benefit is possibility to monitor states between 2 events. In future, we can save some value of starting event (for example, total spend tokens) and the value of ending event (total spend tokens). And we can record the difference between these states as metric (tokens spend tokens for this LLM Call) or something like that
If we reuse the logic from span collector:
- we would need extra logic for OpenTelemetry class to get duration from span or add span as an argument to the MetricCollector, that leads to coupled, dependent classes. Ideally, metrics and spans should not be dependent, as they provide different way to observe system/agent behaviour
- we'll need follow strict matching "span" -> "metric" in the future, otherwise we could not record time property of operation
…tegrate verbosity support
… status handling, and change attributes creation
…ests with MetricCollector validations
…elpers, and streamline attribute checks
Motivation and Context
PR introduces metrics support based on OpenTelemetry specification. Such metrics allow their listeners observe the behaviour of agent in run-time. All changes are done within OpenTelemetry feature
Link to issue
Added configuration:
addMeterExporter(MetricExporter)-- function within OpenTelemetry feature that accept MetricExporter and sent library metrics via such exporterExample of OpenTelemetry feature configuration:
Added metrics:
Visualisation (via OpenTelemetry plugin):


Breaking Changes
Type of the changes
Checklist
developas the base branchAdditional steps for pull requests adding a new feature