💥 Replace OTel Prometheus Exporter#942
Conversation
1d4bafb to
e089c11
Compare
| task_token: self.info.task_token.clone(), | ||
| details, | ||
| }) | ||
| if !self.info.is_local { |
There was a problem hiding this comment.
should we log if trying to heartbeat from LA? Seems like something we wanna signal if it's not something users should be doing. Maybe dbg_panic?
There was a problem hiding this comment.
Yeah, we probably should - this is in the "sdk" rather than Core though, so not as important, but may as well add it yes.
cretz
left a comment
There was a problem hiding this comment.
LGTM. I think before merging we might want a draft PR in Python or .NET or something that updates the core submodule to this branch/commit and demonstrates the changes needed as a result of this. Maybe both, unsure, they do different things with metrics (Python uses metric buffer, but .NET uses the traits directly).
| thiserror = { workspace = true } | ||
| tokio = "1.1" | ||
| tonic = { workspace = true, features = ["tls", "tls-roots"] } | ||
| tonic = { workspace = true, features = ["tls-ring", "tls-native-roots"] } |
There was a problem hiding this comment.
Is this related to the changes for this PR? Any concerns changing this?
There was a problem hiding this comment.
It's part of the reason for all this, so that we could upgrade Tonic. The new feature flags should be the exact equivalent of what it was previously, they just got renamed
|
|
||
| #[error("Configuration loading error: {0}")] | ||
| LoadError(anyhow::Error), | ||
| LoadError(Box<dyn std::error::Error>), |
There was a problem hiding this comment.
It's just something i didn't catch in the review for envconfig - I don't want anyhow in the "public" core api.
| if let Ok(c) = vector.get_metric_with(&labels.as_prom_labels()) { | ||
| Ok(Box::new(CorePromCounter(c))) | ||
| } else { | ||
| Err(self.label_mismatch_err(attributes).into()) |
There was a problem hiding this comment.
Curious, is this error really possible to hit from a user/caller POV since you create the vector before using?
There was a problem hiding this comment.
Not really, no, but if there was a bug in the SDK such that the labels were somehow constructed differently in different places, it could trigger it.
That did happen to me where at one point in one place I was stripping out labels with empty values, but not doing the equivalent thing in a different spot, and it was causing mismatches where a label with an empty value would trigger this.
8dd750e to
d4553e2
Compare
|
Thank you @Sushisource! |
This PR makes some significant alterations to the metrics abstractions and replaces the OTel Prometheus exporter with a custom one, since theirs is no longer maintained. The custom one is essentially a copy-paste of Prom's own first-party lib, except with some modifications made to allow for the registration of metrics with overlapping labels (this is fully permitted by Prom for ingestion, mind you, just prevented client side. They have their reasons for that, which are legitimate, but taking that option away would require fully breaking our API).
Breaking
Arc<dyn XXX>are now concrete typeswith_attributesmethods now return Results, just no good way around this. Dealing with it in lang bridges should be fairly easy though since the possibility of throwing something generally existed already.New
addsorrecords(as opposed to justadd) which can be used (afterwith_attributes, typically) to record without passing labels again. This is more efficient for the prom backend. They needed a different name to avoid turbofish disambiguation nonsense.I did some testing to ensure performance isn't materially different here, and it's not in real-world situations (ex: the
workflow_loadintegration test has no difference in overall runtime). That said, the specific bench I made does show that if you record a bajillion prom metrics as fast as you can from different threads, lock contention slows things down by as much as 50%. At the same time, OTel's implementation of what actually happens when you scrape prom metrics is to just basically create every single metric on demand from scratch and fill in the data, which is definitely slower than what happens now - so actual scraping should be quite a bit cheaper. TLDR: I don't have any reason to believe this would have any real impact, for anyone unless they're using custom metrics with the prom exporter and are just absolutely hammering away.If further perf improvements are desired in the future, we could add a
bind_with_schemakind of API which would bind a metric to a certain set of labels, and allow recording on that metric with different label values without going through a lock.Closes #908
Closes #882