-
Notifications
You must be signed in to change notification settings - Fork 3.9k
feat: implement otlp prom exporter #24158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
// Explicitly record the mid-point of the bucket as approximation: | ||
var value float64 | ||
if i == 0 { | ||
value = boundaries[0] / 2.0 |
Check notice
Code scanning / CodeQL
Floating point arithmetic Note
if i == 0 { | ||
value = boundaries[0] / 2.0 | ||
} else { | ||
value = (boundaries[i-1] + boundaries[i]) / 2.0 |
Check notice
Code scanning / CodeQL
Floating point arithmetic Note
if i == 0 { | ||
value = boundaries[0] / 2.0 | ||
} else { | ||
value = (boundaries[i-1] + boundaries[i]) / 2.0 |
Check notice
Code scanning / CodeQL
Floating point arithmetic Note
087ebdc
to
c373791
Compare
telemetry/otlp_exporter.go
Outdated
|
||
meterProvider := metric.NewMeterProvider( | ||
metric.WithReader(metric.NewPeriodicReader(exporter, | ||
metric.WithInterval(15*time.Second))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the interval be configurable here?
telemetry/otlp_exporter.go
Outdated
metric.WithResource(res), | ||
) | ||
otel.SetMeterProvider(meterProvider) | ||
meter := otel.Meter("cosmos-sdk-otlp-exporter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move to const
telemetry/otlp_exporter.go
Outdated
go func() { | ||
for { | ||
if err := scrapeAndPushMetrics(ctx, cfg.PrometheusEndpoint, meter, gauges, histograms); err != nil { | ||
log.Printf("error scraping metrics: %v", err) | ||
} | ||
time.Sleep(15 * time.Second) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the PeriodicReader
not do this when it calls run
https://github.com/open-telemetry/opentelemetry-go/blob/9e81492af155454dd6072d6134aa31b91193255b/sdk/metric/periodic_reader.go#L126
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is the periodicReader does the pushing to the collector, but this function does the reading prometheus metrics and converting them into otlp metrics.
I renamed the function to scrapePrometheusMetrics
for clarity
telemetry/otlp_exporter.go
Outdated
} | ||
|
||
func scrapeAndPushMetrics(ctx context.Context, promEndpoint string, meter otmetric.Meter, gauges map[string]otmetric.Float64Gauge, histograms map[string]otmetric.Float64Histogram) error { | ||
resp, err := http.Get(promEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a better way to hook up otlp and prometheus than making a get request to the local endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to use the prometheus DefaultGatherer, but that will only gather the metrics registered with the DefaultRegisterer. But it seems like all comet metrics are there and the cosmos SDK does not actually expose any way to use another registerer, so it should be good.
telemetry/metrics.go
Outdated
// Otlp Exporter fields | ||
OtlpExporterEnabled bool `mapstructure:"otlp-exporter-enabled"` | ||
OtlpCollectorGrpcAddr string `mapstructure:"otlp-collector-grpc-addr"` | ||
PrometheusEndpoint string `mapstructure:"prometheus-endpoint"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this will need to be automatically populated based on what the configured prometheus metrics endpoint is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, each chain should probably set their own defaults.
telemetry/otlp_exporter.go
Outdated
go func() { | ||
for { | ||
if err := scrapePrometheusMetrics(ctx, cfg.PrometheusEndpoint, meter, gauges, histograms); err != nil { | ||
log.Printf("error scraping metrics: %v", err) | ||
} | ||
time.Sleep(cfg.OtlpPushInterval) | ||
} | ||
}() |
Check notice
Code scanning / CodeQL
Spawning a Go routine Note
gauges := make(map[string]otmetric.Float64Gauge) | ||
histograms := make(map[string]otmetric.Float64Histogram) | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this behave when shutting a node down? Do we want any kind of graceful shutdown here?
9162415
to
c05a3a5
Compare
Is there a way we can test / verify this works before merging? |
otlpmetrichttp.WithHeaders(map[string]string{ | ||
"Authorization": "Basic " + formatBasicAuth(cfg.OtlpUser, cfg.OtlpToken), | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels brittle in general. Couldn't some collectors expect auth in Http headers, some in grpc request directly, some without authz entirely?
res, _ := resource.New(ctx, resource.WithAttributes( | ||
semconv.ServiceName(cfg.OtlpServiceName), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check the ignored error here? if not, can we comment why we are able to ignore it?
for { | ||
if err := scrapePrometheusMetrics(ctx, meter, gauges, histograms); err != nil { | ||
log.Printf("error scraping metrics: %v", err) | ||
} | ||
time.Sleep(cfg.OtlpPushInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit
for { | |
if err := scrapePrometheusMetrics(ctx, meter, gauges, histograms); err != nil { | |
log.Printf("error scraping metrics: %v", err) | |
} | |
time.Sleep(cfg.OtlpPushInterval) | |
for ; ; time.Sleep(cfg.OtlpPushInterval){ | |
if err := scrapePrometheusMetrics(ctx, meter, gauges, histograms); err != nil { | |
log.Printf("error scraping metrics: %v", err) | |
} |
if math.IsInf(bucket.GetUpperBound(), +1) { | ||
continue // Skip +Inf bucket boundary explicitly | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we skip this? i know it says +Inf boundary, but as someone who is unfamiliar with OTL, i am not sure what the significance is
for j := uint64(0); j < countInBucket; j++ { | ||
hist.Record(ctx, value) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for j := uint64(0); j < countInBucket; j++ { | |
hist.Record(ctx, value) | |
} | |
for range countInBucket { | |
hist.Record(ctx, value) | |
} |
|
||
const meterName = "cosmos-sdk-otlp-exporter" | ||
|
||
func StartOtlpExporter(cfg Config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets have a comment on what this does and how it works
@wllmshao looks like we got some conflicts - i would accept everything from main |
📝 WalkthroughWalkthroughThe changes introduce OTLP (OpenTelemetry Protocol) exporter support to the telemetry subsystem. The telemetry configuration struct is extended with OTLP-specific fields, and a new exporter implementation is added to periodically scrape Prometheus metrics and export them to an OTLP collector endpoint using HTTP and optional authentication. The server startup logic is updated to conditionally start the OTLP exporter based on configuration before initializing telemetry metrics. No public API signatures are changed, and all new functionality is encapsulated in the telemetry package. Changes
Sequence Diagram(s)sequenceDiagram
participant Server
participant TelemetryConfig
participant OTLPExporter
participant Prometheus
participant OTLPCollector
Server->>TelemetryConfig: Load configuration
alt OTLP Exporter enabled
Server->>OTLPExporter: StartOtlpExporter(cfg)
loop Every push interval
OTLPExporter->>Prometheus: Scrape metrics
OTLPExporter->>OTLPCollector: Export metrics via HTTP
end
end
Server->>TelemetryConfig: Initialize telemetry metrics
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
telemetry/otlp_exporter.go (1)
53-60
: 🛠️ Refactor suggestionGoroutine runs forever — tie it to a context
Currently the loop cannot be cancelled, causing leaks in tests and on shutdown.
-go func() { - for { - if err := scrapePrometheusMetrics(ctx, meter, gauges, histograms); err != nil { - log.Printf("error scraping metrics: %v", err) - } - time.Sleep(cfg.OtlpPushInterval) - } -}() +go func() { + ticker := time.NewTicker(cfg.OtlpPushInterval) + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + return + case <-ticker.C: + if err := scrapePrometheusMetrics(ctx, meter, gauges, histograms); err != nil { + log.Printf("error scraping metrics: %v", err) + } + } + } +}()
🧹 Nitpick comments (4)
telemetry/otlp_exporter.go (4)
27-33
: Skip auth header when credentials are emptySending an
Authorization: Basic
header with an empty payload can break collectors that expect either valid creds or no header at all.- otlpmetrichttp.WithHeaders(map[string]string{ - "Authorization": "Basic " + formatBasicAuth(cfg.OtlpUser, cfg.OtlpToken), - }), + otlpmetrichttp.WithHeaders(func() map[string]string { + if cfg.OtlpUser == "" && cfg.OtlpToken == "" { + return nil + } + return map[string]string{ + "Authorization": "Basic " + formatBasicAuth(cfg.OtlpUser, cfg.OtlpToken), + } + }()),
110-120
: Boundary construction: off‑by‑one risk & missing +Inf handling commentSkipping the
+Inf
bucket is correct, but if all buckets are+Inf
(rare but valid)boundaries
becomes empty and the histogram creation fails. Add a safeguard:if len(boundaries) == 0 { return // nothing to record }Also add a comment explaining why
+Inf
is omitted to help future maintainers.
136-153
: Potential CPU blow‑up when buckets contain large countsRecording each observation individually (
for j < countInBucket
) can be O(N) where N is the total number of observations since process start (hundreds of thousands). This may freeze the exporter.Instead, use the
metric.Int64Histogram
Record
once with an attribute representing the count, or keep a delta and record a single value per bucket:- for j := uint64(0); j < countInBucket; j++ { - hist.Record(ctx, value) - } + if countInBucket > 0 { + hist.Record(ctx, value, otmetric.WithAttributeSet( + attribute.Int("count", int(countInBucket)), + )) + }(Note: attribute aggregation semantics depend on backend support; alternatively track deltas and only record once per bucket.)
168-171
: Minor: clarify token terminology
formatBasicAuth
concatenatesusername:token
. If the intent is “password”, rename the param for clarity, or update the comment/config docs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (10)
go.mod
is excluded by!**/*.mod
go.sum
is excluded by!**/*.sum
,!**/*.sum
simapp/go.mod
is excluded by!**/*.mod
simapp/go.sum
is excluded by!**/*.sum
,!**/*.sum
tests/go.mod
is excluded by!**/*.mod
tests/go.sum
is excluded by!**/*.sum
,!**/*.sum
tests/systemtests/go.mod
is excluded by!**/*.mod
tests/systemtests/go.sum
is excluded by!**/*.sum
,!**/*.sum
tools/benchmark/go.mod
is excluded by!**/*.mod
tools/benchmark/go.sum
is excluded by!**/*.sum
,!**/*.sum
📒 Files selected for processing (3)
server/start.go
(1 hunks)telemetry/metrics.go
(1 hunks)telemetry/otlp_exporter.go
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
server/start.go (1)
telemetry/otlp_exporter.go (1)
StartOtlpExporter
(24-61)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Summary
|
||
// Otlp Exporter fields | ||
OtlpExporterEnabled bool `mapstructure:"otlp-exporter-enabled"` | ||
OtlpCollectorEndpoint string `mapstructure:"otlp-collector-endpoint"` | ||
OtlpCollectorMetricsURLPath string `mapstructure:"otlp-collector-metrics-url-path"` | ||
OtlpUser string `mapstructure:"otlp-user"` | ||
OtlpToken string `mapstructure:"otlp-token"` | ||
OtlpServiceName string `mapstructure:"otlp-service-name"` | ||
OtlpPushInterval time.Duration `mapstructure:"otlp-push-interval"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider documenting defaults & ensuring zero‑value safety for new OTLP fields
The new fields are great, but:
- A zero
OtlpPushInterval
will cause the exporter loop to busy‑spin (time.Sleep(0)
yields immediately). - If
OtlpCollectorEndpoint
orOtlpCollectorMetricsURLPath
are left empty,otlpmetrichttp.New
will fail at runtime.
Please:
• Set sensible defaults (e.g. time.Second * 15
).
• Add validation in New(...)
to early‑return a descriptive error when mandatory fields are missing.
if cfg.Telemetry.OtlpExporterEnabled { | ||
telemetry.StartOtlpExporter(cfg.Telemetry) | ||
} | ||
return telemetry.New(cfg.Telemetry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Exporter is started without lifecycle management or error propagation
StartOtlpExporter
(1) blocks fatal‑logging on failure and (2) launches a goroutine that never stops.
Starting it here means:
- No way to shut it down during node shutdown (
graceDuration
, tests, etc.). - Potential race: exporter scrapes metrics before
telemetry.New
registers the Prom sink.
Recommend returning a cancel/cleanup func and wiring it into the existing errgroup
, then starting after telemetry.New
:
- if cfg.Telemetry.OtlpExporterEnabled {
- telemetry.StartOtlpExporter(cfg.Telemetry)
- }
- return telemetry.New(cfg.Telemetry)
+ m, err := telemetry.New(cfg.Telemetry)
+ if err != nil {
+ return nil, err
+ }
+ if cfg.Telemetry.OtlpExporterEnabled {
+ cleanup, err := telemetry.StartOtlpExporter(ctx, cfg.Telemetry) // ctx from getCtx
+ if err != nil {
+ return nil, err
+ }
+ g.Go(func() error { <-ctx.Done(); cleanup(); return nil }) // tie to lifecycle
+ }
+ return m, nil
Committable suggestion skipped: line range outside the PR's diff.
res, _ := resource.New(ctx, resource.WithAttributes( | ||
semconv.ServiceName(cfg.OtlpServiceName), | ||
)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle and log the error returned by resource.New
The second return value is currently discarded. If the resource cannot be created, the exporter will run with incomplete metadata.
-res, _ := resource.New(ctx, resource.WithAttributes(
- semconv.ServiceName(cfg.OtlpServiceName),
-))
+res, rErr := resource.New(ctx, resource.WithAttributes(
+ semconv.ServiceName(cfg.OtlpServiceName),
+))
+if rErr != nil {
+ return fmt.Errorf("failed to initialise OTLP resource: %w", rErr)
+}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
res, _ := resource.New(ctx, resource.WithAttributes( | |
semconv.ServiceName(cfg.OtlpServiceName), | |
)) | |
res, rErr := resource.New(ctx, resource.WithAttributes( | |
semconv.ServiceName(cfg.OtlpServiceName), | |
)) | |
if rErr != nil { | |
return fmt.Errorf("failed to initialise OTLP resource: %w", rErr) | |
} |
if err != nil { | ||
log.Fatalf("OTLP exporter setup failed: %v", err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid log.Fatalf
inside library code
log.Fatalf
terminates the entire node and makes graceful shutdown impossible. Surface the error instead:
- if err != nil {
- log.Fatalf("OTLP exporter setup failed: %v", err)
- }
+ if err != nil {
+ return fmt.Errorf("OTLP exporter setup failed: %w", err)
+ }
…and bubble it up to the caller as noted in the server/start.go
comment.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if err != nil { | |
log.Fatalf("OTLP exporter setup failed: %v", err) | |
} | |
if err != nil { | |
return fmt.Errorf("OTLP exporter setup failed: %w", err) | |
} |
Description
Closes: #XXXX
Add ability to push prometheus metrics to an OTLP collector.
Example new config fields in the [telemetry] section of app.go:
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!
in the type prefix if API or client breaking changeCHANGELOG.md
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
Please see Pull Request Reviewer section in the contributing guide for more information on how to review a pull request.
I have...
Summary by CodeRabbit