Skip to content

💥 Replace OTel Prometheus Exporter#942

Merged
Sushisource merged 19 commits intomasterfrom
replace-otel-prom
Jul 3, 2025
Merged

💥 Replace OTel Prometheus Exporter#942
Sushisource merged 19 commits intomasterfrom
replace-otel-prom

Conversation

@Sushisource
Copy link
Copy Markdown
Member

@Sushisource Sushisource commented Jun 23, 2025

This PR makes some significant alterations to the metrics abstractions and replaces the OTel Prometheus exporter with a custom one, since theirs is no longer maintained. The custom one is essentially a copy-paste of Prom's own first-party lib, except with some modifications made to allow for the registration of metrics with overlapping labels (this is fully permitted by Prom for ingestion, mind you, just prevented client side. They have their reasons for that, which are legitimate, but taking that option away would require fully breaking our API).

Breaking

  • Metric Arc<dyn XXX> are now concrete types
  • with_attributes methods now return Results, just no good way around this. Dealing with it in lang bridges should be fairly easy though since the possibility of throwing something generally existed already.

New

  • All the metrics now have adds or records (as opposed to just add) which can be used (after with_attributes, typically) to record without passing labels again. This is more efficient for the prom backend. They needed a different name to avoid turbofish disambiguation nonsense.

I did some testing to ensure performance isn't materially different here, and it's not in real-world situations (ex: the workflow_load integration test has no difference in overall runtime). That said, the specific bench I made does show that if you record a bajillion prom metrics as fast as you can from different threads, lock contention slows things down by as much as 50%. At the same time, OTel's implementation of what actually happens when you scrape prom metrics is to just basically create every single metric on demand from scratch and fill in the data, which is definitely slower than what happens now - so actual scraping should be quite a bit cheaper. TLDR: I don't have any reason to believe this would have any real impact, for anyone unless they're using custom metrics with the prom exporter and are just absolutely hammering away.

If further perf improvements are desired in the future, we could add a bind_with_schema kind of API which would bind a metric to a certain set of labels, and allow recording on that metric with different label values without going through a lock.

Closes #908
Closes #882

@Sushisource Sushisource force-pushed the replace-otel-prom branch 4 times, most recently from 1d4bafb to e089c11 Compare June 24, 2025 23:09
@Sushisource Sushisource marked this pull request as ready for review June 24, 2025 23:09
@Sushisource Sushisource requested a review from a team as a code owner June 24, 2025 23:09
@Sushisource Sushisource changed the title [DRAFT] Replace OTel Prometheus Exporter 💥 Replace OTel Prometheus Exporter Jun 24, 2025
task_token: self.info.task_token.clone(),
details,
})
if !self.info.is_local {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log if trying to heartbeat from LA? Seems like something we wanna signal if it's not something users should be doing. Maybe dbg_panic?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we probably should - this is in the "sdk" rather than Core though, so not as important, but may as well add it yes.

Copy link
Copy Markdown
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think before merging we might want a draft PR in Python or .NET or something that updates the core submodule to this branch/commit and demonstrates the changes needed as a result of this. Maybe both, unsure, they do different things with metrics (Python uses metric buffer, but .NET uses the traits directly).

Comment thread client/Cargo.toml
thiserror = { workspace = true }
tokio = "1.1"
tonic = { workspace = true, features = ["tls", "tls-roots"] }
tonic = { workspace = true, features = ["tls-ring", "tls-native-roots"] }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the changes for this PR? Any concerns changing this?

Copy link
Copy Markdown
Member Author

@Sushisource Sushisource Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the reason for all this, so that we could upgrade Tonic. The new feature flags should be the exact equivalent of what it was previously, they just got renamed

Comment thread core-api/src/envconfig.rs

#[error("Configuration loading error: {0}")]
LoadError(anyhow::Error),
LoadError(Box<dyn std::error::Error>),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems unrelated?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just something i didn't catch in the review for envconfig - I don't want anyhow in the "public" core api.

if let Ok(c) = vector.get_metric_with(&labels.as_prom_labels()) {
Ok(Box::new(CorePromCounter(c)))
} else {
Err(self.label_mismatch_err(attributes).into())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is this error really possible to hit from a user/caller POV since you create the vector before using?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, no, but if there was a bug in the SDK such that the labels were somehow constructed differently in different places, it could trigger it.

That did happen to me where at one point in one place I was stripping out labels with empty values, but not doing the equivalent thing in a different spot, and it was causing mismatches where a label with an empty value would trigger this.

@Sushisource Sushisource merged commit 42cc51a into master Jul 3, 2025
30 of 31 checks passed
@Sushisource Sushisource deleted the replace-otel-prom branch July 3, 2025 21:47
@connyay
Copy link
Copy Markdown
Contributor

connyay commented Jul 4, 2025

Thank you @Sushisource!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Update to tonic 0.13 [Bug] PrometheusExporterOptions.global_tags not working for all metrics

4 participants