:boom: Replace OTel Prometheus Exporter by Sushisource · Pull Request #942 · temporalio/sdk-rust

Sushisource · 2025-06-23T17:26:15Z

This PR makes some significant alterations to the metrics abstractions and replaces the OTel Prometheus exporter with a custom one, since theirs is no longer maintained. The custom one is essentially a copy-paste of Prom's own first-party lib, except with some modifications made to allow for the registration of metrics with overlapping labels (this is fully permitted by Prom for ingestion, mind you, just prevented client side. They have their reasons for that, which are legitimate, but taking that option away would require fully breaking our API).

Breaking

Metric Arc<dyn XXX> are now concrete types
with_attributes methods now return Results, just no good way around this. Dealing with it in lang bridges should be fairly easy though since the possibility of throwing something generally existed already.

New

All the metrics now have adds or records (as opposed to just add) which can be used (after with_attributes, typically) to record without passing labels again. This is more efficient for the prom backend. They needed a different name to avoid turbofish disambiguation nonsense.

I did some testing to ensure performance isn't materially different here, and it's not in real-world situations (ex: the workflow_load integration test has no difference in overall runtime). That said, the specific bench I made does show that if you record a bajillion prom metrics as fast as you can from different threads, lock contention slows things down by as much as 50%. At the same time, OTel's implementation of what actually happens when you scrape prom metrics is to just basically create every single metric on demand from scratch and fill in the data, which is definitely slower than what happens now - so actual scraping should be quite a bit cheaper. TLDR: I don't have any reason to believe this would have any real impact, for anyone unless they're using custom metrics with the prom exporter and are just absolutely hammering away.

If further perf improvements are desired in the future, we could add a bind_with_schema kind of API which would bind a metric to a certain set of labels, and allow recording on that metric with different label values without going through a lock.

Closes #908
Closes #882

yuandrew · 2025-06-25T00:03:07Z

-            task_token: self.info.task_token.clone(),
-            details,
-        })
+        if !self.info.is_local {


should we log if trying to heartbeat from LA? Seems like something we wanna signal if it's not something users should be doing. Maybe dbg_panic?

Yeah, we probably should - this is in the "sdk" rather than Core though, so not as important, but may as well add it yes.

cretz

LGTM. I think before merging we might want a draft PR in Python or .NET or something that updates the core submodule to this branch/commit and demonstrates the changes needed as a result of this. Maybe both, unsure, they do different things with metrics (Python uses metric buffer, but .NET uses the traits directly).

cretz · 2025-06-25T13:10:28Z

 thiserror = { workspace = true }
 tokio = "1.1"
-tonic = { workspace = true, features = ["tls", "tls-roots"] }
+tonic = { workspace = true, features = ["tls-ring", "tls-native-roots"] }


Is this related to the changes for this PR? Any concerns changing this?

It's part of the reason for all this, so that we could upgrade Tonic. The new feature flags should be the exact equivalent of what it was previously, they just got renamed

cretz · 2025-06-25T13:11:07Z


    #[error("Configuration loading error: {0}")]
-    LoadError(anyhow::Error),
+    LoadError(Box<dyn std::error::Error>),


This also seems unrelated?

It's just something i didn't catch in the review for envconfig - I don't want anyhow in the "public" core api.

cretz · 2025-06-25T13:17:03Z

+        if let Ok(c) = vector.get_metric_with(&labels.as_prom_labels()) {
+            Ok(Box::new(CorePromCounter(c)))
+        } else {
+            Err(self.label_mismatch_err(attributes).into())


Curious, is this error really possible to hit from a user/caller POV since you create the vector before using?

Not really, no, but if there was a bug in the SDK such that the labels were somehow constructed differently in different places, it could trigger it.

That did happen to me where at one point in one place I was stripping out labels with empty values, but not doing the equivalent thing in a different spot, and it was causing mismatches where a label with an empty value would trigger this.

connyay · 2025-07-04T01:22:26Z

Thank you @Sushisource!

Sushisource force-pushed the replace-otel-prom branch 4 times, most recently from 1d4bafb to e089c11 Compare June 24, 2025 23:09

Sushisource marked this pull request as ready for review June 24, 2025 23:09

Sushisource requested a review from a team as a code owner June 24, 2025 23:09

Sushisource changed the title ~~[DRAFT] Replace OTel Prometheus Exporter~~ 💥 Replace OTel Prometheus Exporter Jun 24, 2025

yuandrew reviewed Jun 25, 2025

View reviewed changes

cretz approved these changes Jun 25, 2025

View reviewed changes

Sushisource added 16 commits July 3, 2025 09:23

Agents updates

f4959c0

Partially complete

e62b836

Remove otel prometheus

7769292

Use prometheus directly instead of otel prometheus exporter

f4ac3c5

Cache labels->metrics internally

8a8cfb4

Make registry allow for dupes

f61b292

Only one top-level cache

7f595b1

Change to with_attributes

b1db6db

Boxification

9e15229

Use Prom more efficiently

33f8c2d

Lazily bind metrics

40b3241

Test cleanup

8e3b704

Return results from with attributes

c02ccda

More cleanup

5e22714

Some quick benching

33cc3ad

Fix span-drop issue in heavy tests

d4553e2

Sushisource force-pushed the replace-otel-prom branch from 8dd750e to d4553e2 Compare July 3, 2025 16:24

Sushisource added 3 commits July 3, 2025 13:46

Changes to fix issues discovered when using from Python

f7592d7

Fix C bridge

545d130

Lint fix

424f7bf

Sushisource merged commit 42cc51a into master Jul 3, 2025
30 of 31 checks passed

Sushisource deleted the replace-otel-prom branch July 3, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💥 Replace OTel Prometheus Exporter#942

💥 Replace OTel Prometheus Exporter#942
Sushisource merged 19 commits intomasterfrom
replace-otel-prom

Sushisource commented Jun 23, 2025 •

edited

Loading

Uh oh!

yuandrew Jun 25, 2025

Uh oh!

Sushisource Jun 25, 2025

Uh oh!

cretz left a comment

Uh oh!

cretz Jun 25, 2025

Uh oh!

Sushisource Jun 25, 2025 •

edited

Loading

Uh oh!

cretz Jun 25, 2025

Uh oh!

Sushisource Jun 25, 2025

Uh oh!

cretz Jun 25, 2025

Uh oh!

Sushisource Jun 26, 2025

Uh oh!

Uh oh!

connyay commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sushisource commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Breaking

New

Uh oh!

yuandrew Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

cretz Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

cretz Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

connyay commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sushisource commented Jun 23, 2025 •

edited

Loading

Sushisource Jun 25, 2025 •

edited

Loading