Skip to content
This repository has been archived by the owner on Sep 10, 2024. It is now read-only.
This repository has been archived by the owner on Sep 10, 2024. It is now read-only.

Generate span metrics at the collector level #33

Open
@phxnsharp

Description

I am using Jaeger from ephemeral but long running (minutes to days) processes to trace execution of engineering workflows. The tracing part is working great.

I would like to additionally track metrics from these processes. Since Prometheus is notoriously bad at handling ephemeral processes, and since Jaeger already provides a high performance, reliable, and scalable data path for the trace data, I would like to collect the metrics on the server side much along the lines of https://medium.com/jaegertracing/data-analytics-with-jaeger-aka-traces-tell-us-more-973669e6f848 . However, I would prefer to not add the additional requirements of running and maintaining Kafka.

I have created a prototype gRPC storage plugin which accepts trace data, but does not handle read operations. Since Jaeger allows multiple storage plug-ins but only reads from the first, it can be installed behind Cassandra or Elasticsearch plug-ins.

This plug-in uses the Golang Prometheus client to provide metrics on the spans that it sees. Currently it is hardcoded to collect the metrics that I particularly need and is not generic.

The metrics I am currently collecting do not require assimilating multiple spans. The main ones we are looking to get are average duration, run count, and failure count for particular span types. For us, our durations are long so latency effects between trace spans aren't that interesting. I am converting some, but not all, of the span tags into labels so that I can issue the required queries out of Prometheus.

One difficulty with this solution is that Prometheus expects each collection target to be definitive. With Jaeger scalability, the collector can be replicated. Prometheus currently expects that a single scrape target contains all the values for a particular time series/label combination. It has no ability to sum or aggregate values from different scrape targets that match, even with the honor_labels option (if you try, it ends up flopping back and forth between the values each scrape target provides). Without honor_labels, you can easily write labels for the actual source instance/ip and write queries to sum the results however you want, but there is a significant implication for the Prometheus time-series storage. In my case, if I have n computers reporting traces and m replicas of the jaeger-collector, I'll end up with n*m time series in Prometheus' storage.

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions