Skip to content

Commit 353133d

Browse files
authored
jaeger v2 otel exporter alerts (#552)
* feat(jaeger): add v2 OTEL-based alerts and keep v1 as legacy Jaeger v2 is built on OpenTelemetry Collector and no longer exposes jaeger_agent_* / jaeger_collector_* / jaeger_client_* metrics. - Add "Embedded exporter (v2+)" with 8 rules targeting: - jaeger_storage_requests_total (error rate, unavailability, no reads) - jaeger_storage_latency_seconds_bucket (p99 latency) - http_server_request_duration_seconds_* via otelhttp (search errors, search latency, single-trace retrieval latency, service discovery errors) - Rename existing exporter to "Embedded exporter (legacy, <v2)" with slug embedded-exporter-legacy and a v1 EOL notice (Dec 31 2025) * chore: adding node version to github action
1 parent eccf556 commit 353133d

2 files changed

Lines changed: 76 additions & 2 deletions

File tree

.github/workflows/site.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ jobs:
2525
with:
2626
cache: npm
2727
cache-dependency-path: site/package-lock.json
28+
node-version: 'latest'
2829

2930
- name: Install dependencies
3031
working-directory: site

_data/rules.yml

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5733,9 +5733,82 @@ groups:
57335733

57345734
- name: Jaeger
57355735
exporters:
5736-
- name: Embedded exporter
5736+
- name: Embedded exporter (v2+)
57375737
slug: embedded-exporter
5738-
doc_url: https://www.jaegertracing.io/docs/latest/monitoring/
5738+
doc_url: https://www.jaegertracing.io/docs/2.dev/operations/monitoring/
5739+
comments: |
5740+
Jaeger v2 is built on OpenTelemetry Collector and exposes metrics on port 8888 (/metrics).
5741+
It emits standard otelcol_* pipeline metrics alongside Jaeger-specific storage and query metrics.
5742+
For span ingestion pipeline alerts (refused spans, export failures, queue saturation),
5743+
use the OpenTelemetry Collector rules instead.
5744+
rules:
5745+
- name: Jaeger high storage error rate
5746+
description: "Jaeger on {{ $labels.instance }} is experiencing {{ $value | humanize }}% storage errors on {{ $labels.operation }}."
5747+
query: '100 * sum(rate(jaeger_storage_requests_total{result="err"}[1m])) by (instance, job, namespace, operation) / sum(rate(jaeger_storage_requests_total[1m])) by (instance, job, namespace, operation) > 1 and sum(rate(jaeger_storage_requests_total[1m])) by (instance, job, namespace, operation) > 0'
5748+
severity: warning
5749+
for: 5m
5750+
- name: Jaeger slow storage operations
5751+
description: "Jaeger on {{ $labels.instance }} storage p99 latency for {{ $labels.operation }} is {{ $value | humanizeDuration }}."
5752+
query: 'histogram_quantile(0.99, sum(rate(jaeger_storage_latency_seconds_bucket[5m])) by (le, instance, job, namespace, operation)) > 1'
5753+
severity: warning
5754+
for: 5m
5755+
comments: |
5756+
Threshold of 1s is a rough default. Adjust based on your storage backend and data volume.
5757+
- name: Jaeger query service high error rate
5758+
description: "Jaeger query service on {{ $labels.instance }} is returning {{ $value | humanize }}% HTTP 5xx errors."
5759+
query: '100 * sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces",http_response_status_code=~"5.."}[1m])) by (instance, job, namespace) / sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces"}[1m])) by (instance, job, namespace) > 1 and sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces"}[1m])) by (instance, job, namespace) > 0'
5760+
severity: warning
5761+
for: 5m
5762+
comments: |
5763+
Filters on http_route="/api/traces" (the trace search endpoint). The http_server_request_duration_seconds
5764+
metric is emitted by the otelhttp middleware used by the Jaeger query service.
5765+
- name: Jaeger query service slow responses
5766+
description: "Jaeger query service on {{ $labels.instance }} p99 response latency is {{ $value | humanizeDuration }}."
5767+
query: 'histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{http_route="/api/traces"}[5m])) by (le, instance, job, namespace)) > 2'
5768+
severity: warning
5769+
for: 5m
5770+
comments: |
5771+
Threshold of 2s is a rough default. Adjust based on your storage backend and data volume.
5772+
- name: Jaeger storage completely unavailable
5773+
description: "Jaeger on {{ $labels.instance }} has 100% storage errors for {{ $labels.operation }} — storage backend may be down."
5774+
query: 'sum(rate(jaeger_storage_requests_total{result="err"}[1m])) by (instance, job, namespace, operation) > 0 and sum(rate(jaeger_storage_requests_total{result="ok"}[1m])) by (instance, job, namespace, operation) == 0'
5775+
severity: critical
5776+
for: 2m
5777+
comments: |
5778+
Fires when all storage operations for a given type are failing and none are succeeding.
5779+
Indicates the storage backend (Cassandra, Elasticsearch, etc.) is likely unreachable or misconfigured.
5780+
- name: Jaeger slow single trace retrieval
5781+
description: "Jaeger on {{ $labels.instance }} p99 latency for single trace retrieval is {{ $value | humanizeDuration }}."
5782+
query: 'histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{http_route="/api/traces/{traceID}"}[5m])) by (le, instance, job, namespace)) > 5'
5783+
severity: warning
5784+
for: 5m
5785+
comments: |
5786+
Single trace retrieval (/api/traces/{traceID}) can be slower than search, especially for large traces.
5787+
Threshold of 5s is a rough default.
5788+
- name: Jaeger service discovery errors
5789+
description: "Jaeger on {{ $labels.instance }} is returning {{ $value | humanize }}% HTTP 5xx errors on the services endpoint."
5790+
query: '100 * sum(rate(http_server_request_duration_seconds_count{http_route="/api/services",http_response_status_code=~"5.."}[1m])) by (instance, job, namespace) / sum(rate(http_server_request_duration_seconds_count{http_route="/api/services"}[1m])) by (instance, job, namespace) > 1 and sum(rate(http_server_request_duration_seconds_count{http_route="/api/services"}[1m])) by (instance, job, namespace) > 0'
5791+
severity: warning
5792+
for: 5m
5793+
comments: |
5794+
Errors on /api/services indicate the storage backend cannot return the list of instrumented services,
5795+
which breaks the Jaeger UI service selector.
5796+
- name: Jaeger no storage reads succeeding
5797+
description: "Jaeger on {{ $labels.instance }} has no successful storage reads for {{ $labels.operation }} in the past 15 minutes."
5798+
query: 'sum(increase(jaeger_storage_requests_total{result="ok"}[15m])) by (instance, job, namespace, operation) == 0 and sum(increase(jaeger_storage_requests_total[15m])) by (instance, job, namespace, operation) > 0'
5799+
severity: warning
5800+
for: 5m
5801+
comments: |
5802+
Fires when an operation (e.g. find_traces, get_services) has received requests but none succeeded.
5803+
May indicate a persistent storage error or a backend that is slow to recover.
5804+
- name: Embedded exporter (legacy, <v2)
5805+
slug: embedded-exporter-legacy
5806+
doc_url: https://www.jaegertracing.io/docs/1.x/monitoring/
5807+
comments: |
5808+
These rules target Jaeger v1.x metrics (jaeger_* prefix).
5809+
Jaeger v1 reached end-of-life on December 31, 2025.
5810+
For Jaeger v2+, use the "Embedded exporter (v2+)" rules instead.
5811+
Note: jaeger-agent was deprecated in v1.35 and removed in v2.0.
57395812
rules:
57405813
- name: Jaeger agent HTTP server errors
57415814
description: "Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors."

0 commit comments

Comments
 (0)