-
Notifications
You must be signed in to change notification settings - Fork 822
fix: avoid Prometheus collisions via multi-registry scrape #5678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: avoid Prometheus collisions via multi-registry scrape #5678
Conversation
Stop registering metrics into parent registries. Instead, track child registries in MetricsRegistry and merge them at scrape time, warning and dropping duplicate series while requiring consistent HELP/TYPE for each metric name. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
WalkthroughThe changes introduce hierarchical metrics registry wiring across components and namespaces. Metrics creation is now localized to individual registries, with a new combined scraping mechanism that merges outputs across parent-child registry relationships for consolidated exposition. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/runtime/src/component.rs`:
- Around line 483-498: The namespace() method registers the child's metrics
registry twice (via
child.drt().get_metrics_registry().register_child_registry(...) and again via
self.get_metrics_registry().register_child_registry(...)); remove the redundant
DRT registration or document intent. Fix by either deleting the call that
registers the child's registry through child.drt() (the
child.drt().get_metrics_registry().register_child_registry(...) invocation in
namespace()) or, if intentional, add a clear inline comment above that call
explaining why both registrations are required for future behavior, referencing
NamespaceBuilder, namespace(), drt(), get_metrics_registry(), and
register_child_registry() to locate the code.
🧹 Nitpick comments (2)
lib/runtime/src/metrics.rs (2)
668-681: Consider documenting thread-safety guarantees for register_child_registry.The deduplication by
Arc::as_ptris a good approach to avoid duplicate registrations via clones. However, this method acquires a write lock onchild_registries. Ifregister_child_registryis called from within an update callback that already holds a read lock on the same registry'schild_registries, this could deadlock.Consider adding a doc comment noting that this method should not be called from within update/expfmt callbacks to avoid potential lock inversion scenarios.
📝 Suggested documentation
/// Register a child registry to be included in combined /metrics output. /// /// Dedup is by underlying Prometheus registry pointer, so repeated registration via clones is safe. + /// + /// # Thread Safety + /// This method acquires a write lock on `child_registries`. Do not call from within + /// update callbacks or expfmt callbacks to avoid potential deadlocks. pub fn register_child_registry(&self, child: &MetricsRegistry) {
797-803: Consider pre-allocating buffer capacity for large deployments.For systems with many registries and metrics, the encoded output could be substantial. Pre-allocating the buffer based on an estimate could reduce reallocations.
♻️ Optional optimization
let encoder = prometheus::TextEncoder::new(); - let mut buffer = Vec::new(); + // Estimate ~200 bytes per metric family as a heuristic + let mut buffer = Vec::with_capacity(merged.len() * 200); encoder.encode(&merged, &mut buffer)?;
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
| drt_output_raw | ||
| ); | ||
|
|
||
| println!("✓ All Prometheus format outputs verified successfully!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add specific test cases to cover the name collision scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right! Just added 2 tests to illustrate the functions 1) adding same name via different static labels 2) adding same name AND same static labels to generate a warning message.
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Overview:
Work around Prometheus metric registration collisions by moving to a multi-registry scrape model. This allows the same metric name across endpoints as long as label sets differ.
Details:
MetricsRegistryto produce one combined exposition output./metricshandler to usedrt.metrics().prometheus_expfmt()(which now returns the combined output).Where should the reviewer start?
lib/runtime/src/metrics.rs(multi-registry merge logic + changed registration semantics)lib/runtime/src/component.rs(child registry wiring)lib/runtime/src/system_status_server.rs(scrape path)Related Issues: (use Closes / Fixes / Resolves / Relates to)
DIS-1339
/coderabbit profile chill
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.