Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Jan 27, 2026

Overview:

Work around Prometheus metric registration collisions by moving to a multi-registry scrape model. This allows the same metric name across endpoints as long as label sets differ.

Details:

  • Add child-registry tracking + recursive traversal + merge/dedupe logic in MetricsRegistry to produce one combined exposition output.
  • Update /metrics handler to use drt.metrics().prometheus_expfmt() (which now returns the combined output).

Where should the reviewer start?

  • lib/runtime/src/metrics.rs (multi-registry merge logic + changed registration semantics)
  • lib/runtime/src/component.rs (child registry wiring)
  • lib/runtime/src/system_status_server.rs (scrape path)

Related Issues: (use Closes / Fixes / Resolves / Relates to)

DIS-1339

/coderabbit profile chill

Summary by CodeRabbit

  • New Features

    • The /metrics endpoint now properly consolidates and exposes metrics from all components, namespaces, and endpoints, providing comprehensive system observability and visibility into overall health and performance.
  • Bug Fixes

    • Improved error handling for the metrics endpoint with clearer HTTP error responses when metric collection fails.

✏️ Tip: You can customize this high-level summary in your review settings.

Stop registering metrics into parent registries. Instead, track child registries
in MetricsRegistry and merge them at scrape time, warning and dropping duplicate
series while requiring consistent HELP/TYPE for each metric name.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang self-assigned this Jan 27, 2026
@keivenchang keivenchang requested a review from a team as a code owner January 27, 2026 06:34
@github-actions github-actions bot added the fix label Jan 27, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

Walkthrough

The changes introduce hierarchical metrics registry wiring across components and namespaces. Metrics creation is now localized to individual registries, with a new combined scraping mechanism that merges outputs across parent-child registry relationships for consolidated exposition.

Changes

Cohort / File(s) Summary
Component and Namespace Metrics Wiring
lib/runtime/src/component.rs
Added explicit metrics registry parent-child registration in endpoint(), Namespace::new(), Namespace::component(), and Namespace::namespace() to wire metrics hierarchies during construction.
Metrics Registry Hierarchy and Combined Scraping
lib/runtime/src/metrics.rs
Introduced register_child_registry() and prometheus_expfmt_combined() methods to support child registry tracking and multi-registry metric merging with deduplication and validation. Localized metric creation to individual registries instead of propagating to parents. Updated test assertions for output formatting.
Metrics Endpoint Error Handling
lib/runtime/src/system_status_server.rs
Added explicit error handling for Prometheus exposition format generation in the metrics handler, returning HTTP 500 on failures. Updated comments to reflect the multi-registry model.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop through registries, parent and child,
Metrics now nested, organized and mild,
Combined scraping brings harmony true,
Each burrow of data shines bright and brand new! 📊✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: moving to a multi-registry scrape model to avoid Prometheus collisions.
Description check ✅ Passed The description includes Overview, Details with file-specific changes, Where to start guidance, and Related Issues sections, meeting the template requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@lib/runtime/src/component.rs`:
- Around line 483-498: The namespace() method registers the child's metrics
registry twice (via
child.drt().get_metrics_registry().register_child_registry(...) and again via
self.get_metrics_registry().register_child_registry(...)); remove the redundant
DRT registration or document intent. Fix by either deleting the call that
registers the child's registry through child.drt() (the
child.drt().get_metrics_registry().register_child_registry(...) invocation in
namespace()) or, if intentional, add a clear inline comment above that call
explaining why both registrations are required for future behavior, referencing
NamespaceBuilder, namespace(), drt(), get_metrics_registry(), and
register_child_registry() to locate the code.
🧹 Nitpick comments (2)
lib/runtime/src/metrics.rs (2)

668-681: Consider documenting thread-safety guarantees for register_child_registry.

The deduplication by Arc::as_ptr is a good approach to avoid duplicate registrations via clones. However, this method acquires a write lock on child_registries. If register_child_registry is called from within an update callback that already holds a read lock on the same registry's child_registries, this could deadlock.

Consider adding a doc comment noting that this method should not be called from within update/expfmt callbacks to avoid potential lock inversion scenarios.

📝 Suggested documentation
     /// Register a child registry to be included in combined /metrics output.
     ///
     /// Dedup is by underlying Prometheus registry pointer, so repeated registration via clones is safe.
+    ///
+    /// # Thread Safety
+    /// This method acquires a write lock on `child_registries`. Do not call from within
+    /// update callbacks or expfmt callbacks to avoid potential deadlocks.
     pub fn register_child_registry(&self, child: &MetricsRegistry) {

797-803: Consider pre-allocating buffer capacity for large deployments.

For systems with many registries and metrics, the encoded output could be substantial. Pre-allocating the buffer based on an estimate could reduce reallocations.

♻️ Optional optimization
         let encoder = prometheus::TextEncoder::new();
-        let mut buffer = Vec::new();
+        // Estimate ~200 bytes per metric family as a heuristic
+        let mut buffer = Vec::with_capacity(merged.len() * 200);
         encoder.encode(&merged, &mut buffer)?;

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
drt_output_raw
);

println!("✓ All Prometheus format outputs verified successfully!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add specific test cases to cover the name collision scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right! Just added 2 tests to illustrate the functions 1) adding same name via different static labels 2) adding same name AND same static labels to generate a warning message.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants