Live check metrics #728

jerbly · 2025-05-05T00:28:23Z

Adds support for Metrics to live-check. The nested nature of this data brought about further refactoring internally but also a change to the rego interface. input now has sample and registry_attribute or registry_group - this allows the policy to work within the nested context. For example, providing some advice on an exemplar in a data-point in a metric by referring to the metric's defined unit. This is documented in the readme and an example policy is included in the tests.

Includes support for Exemplars:

…ck-metrics

jsuereth · 2025-05-05T19:12:12Z

crates/weaver_live_check/src/lib.rs

+    /// A sample metric
+    Metric(&'a SampleMetric),
+    /// A sample number data point
+    NumberDataPoint(&'a NumberDataPoint),


Interesting - Is the convention going to be check at the highest level, but higher-level things don't include lower level details?

E.g. NumberDataPoint is part of a Metric and has metric identifying things, but SampleMetric wouldn't?

Also, can you rename to SampledNumberDataPoint?

This area is not quite right yet. A NumberDataPoint without the information from the Metric is not so useful, but I need a way to pass it in around in a consistent way.

Yes, good call on the renaming. Will do.

I'm looking into the consistency of the framework. For each data_point, I want to be able to provide advice at that level (checking attributes required for example), but I need context from the parent group. I'm hoping to get this change done today. This will change rego so the input is a tuple of (sample,semconv).

Just pushed this with docs to explain and a rego example.

# This example shows how to use the registry_group provided in the input. # If the metric's unit is "By" the value in this data-point must be an integer. deny contains make_advice(advice_type, advice_level, value, message) if { input.sample.number_data_point value := input.sample.number_data_point.value input.registry_group.unit == "By" value != floor(value) # not a good type check, but serves as an example advice_type := "invalid_data_point_value" advice_level := "violation" message := "Value must be an integer when unit is 'By'" }

crates/weaver_live_check/src/sample_metric.rs

jsuereth · 2025-05-06T11:45:38Z

defaults/policies/live_check_advice/otel.rego

@@ -40,6 +40,16 @@ deny contains make_advice(advice_type, advice_level, value, message) if {
 	message := "Does not match name formatting rules"
 }

+# checks metric name format
+deny contains make_advice(advice_type, advice_level, value, message) if {


Can you add a check to make sure attributes on the metric match attributes defined on the group?

This is already done internally in rust rather than in rego since I assumed this was a fundamental thing. See the attribute_required advice under each data_point in the screenshot above. This is at the data_point level since the attributes are provided with each point.

…ck-metrics

codecov · 2025-05-07T00:33:19Z

Codecov Report

Attention: Patch coverage is 91.02564% with 21 lines in your changes missing coverage. Please review.

Project coverage is 77.1%. Comparing base (9d8a5d7) to head (2607acc).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/weaver_live_check/src/advice.rs	83.1%	13 Missing ⚠️
crates/weaver_live_check/src/lib.rs	80.9%	8 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #728     +/-   ##
=======================================
+ Coverage   76.7%   77.1%   +0.4%     
=======================================
  Files         65      66      +1     
  Lines       5036    5211    +175     
=======================================
+ Hits        3863    4019    +156     
- Misses      1173    1192     +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…check and weaver_checker

…ck-metrics

…le types.

…ck-metrics

lquerel

Thanks for this PR. Adding metrics support will significantly enhance the live-check feature. I’ve made several suggestions. There are certain types we can avoid duplicating (especially the leaf enums). For the other structures, I understand that you’re adding at least one custom field for the results. It’s frustrating to see all this duplicated code, but unfortunately, I don’t see a straightforward alternative solution.

It would also be good to remove instances of &Rc<T> or Rc<Rc<T>>, as they seem semantically redundant and not very idiomatic.

lquerel · 2025-05-21T21:45:55Z

crates/weaver_live_check/README.md

@@ -202,7 +211,9 @@ These should be self-explanatory, but:
 - `no_advice_count` is the number of samples that received no advice
 - `seen_registry_attributes` is a record of how many times each attribute in the registry was seen in the samples
 - `seen_non_registry_attributes` is a record of how many times each non-registry attribute was seen in the samples
- `registry_coverage` is the fraction of seen registry attributes over the total registry attributes
+- `seen_registry_metrics` is a record of how many times each metric in the registry was seen in the samples
+- `seen_non_registry_attributes` is a record of how many times each non-registry metric was seen in the samples


Should not that be seen_non_registry_metrics ? The comment talks about metrics and not attributes.

lquerel · 2025-05-21T21:52:34Z

crates/weaver_live_check/src/advice.rs

+        registry_attribute: Option<&Rc<Attribute>>,
+        registry_group: Option<&Rc<ResolvedGroup>>,


Not a big deal, but having a function signature with Option<&Rc<Attribute>> feels a bit like having a double reference to something. It would seem more idiomatic to me to just have Option<Rc<Attribute>>.

lquerel · 2025-05-21T21:53:00Z

crates/weaver_live_check/src/advice.rs

+        registry_attribute: Option<&Rc<Attribute>>,
+        registry_group: Option<&Rc<ResolvedGroup>>,


See previous comment.

lquerel · 2025-05-21T21:59:18Z

crates/weaver_live_check/src/advice.rs

+    for required_attribute in required_attributes {
+        // Check if the attribute is present in the sample
+        if !attributes
+            .iter()
+            .any(|attribute| attribute.name == required_attribute.name)
+        {
+            advice_list.push(Advice {
+                advice_type: "attribute_required".to_owned(),
+                value: Value::String(required_attribute.name.clone()),
+                message: "Attribute is required".to_owned(),
+                advice_level: AdviceLevel::Violation,
+            });
+        }
+    }


It's n*m complexity when n is the number of required attributes and m is the number of sample attributes. It might be useful for protecting the live-check function to 1) convert one of these two lists into a map, and 2) check the size of the list coming from the network to avoid any issues.

crates/weaver_live_check/src/advice.rs

lquerel · 2025-05-22T03:34:16Z

crates/weaver_live_check/src/live_checker.rs

-                        templates_by_length.push((attribute.name.clone(), attribute.clone()));
-                        let _ = semconv_templates.insert(attribute.name.clone(), attribute.clone());
+                        templates_by_length
+                            .push((attribute.name.clone(), Rc::clone(&attribute_rc)));


Suggested change

.push((attribute.name.clone(), Rc::clone(&attribute_rc)));

.push((attribute.name.clone(), attribute_rc.clone()));

attribute_rc is already an Rc, so I don't see the point of creating an Rc<&Rc<Attribute>>.

lquerel · 2025-05-22T03:41:38Z

crates/weaver_live_check/src/sample_metric.rs

+pub enum SampleInstrument {
+    /// An up-down counter metric.
+    UpDownCounter,
+    /// A counter metric.
+    Counter,
+    /// A gauge metric.
+    Gauge,
+    /// A histogram metric.
+    Histogram,
+    /// A summary metric. This is no longer used and will cause a violation.
+    Summary,
+    /// Unspecified instrument type.
+    Unspecified,
+}


At the type level, do we have a way to keep in-sync the SampleInstrument and the InstrumentSpec enums? Why not using directly InstrumentSpec?

InstrumentSpec does not support Summary, but OTLP does.

I'd recommend creating a wrapper enum where it's either SupportedMetricType(InstrumentSpec) or UnsupportedMetricType(String)

lquerel · 2025-05-22T04:07:53Z

src/registry/otlp/conversion.rs

+            match value {
+                GrpcValue::StringValue(string) => Some(Value::String(string)),
+                GrpcValue::IntValue(int_value) => Some(Value::Number(int_value.into())),
+                GrpcValue::DoubleValue(double_value) => Some(json!(double_value)),


Just out of curiosity. Why json!(double_value) and not Value::Number(double_value.into())

the trait bound serde_json::Number: std::convert::From<f64> is not satisfied
the following other types implement trait std::convert::From<T>:
serde_json::Number implements std::convert::From<i16>
serde_json::Number implements std::convert::From<i32>
serde_json::Number implements std::convert::From<i64>
serde_json::Number implements std::convert::From<i8>
serde_json::Number implements std::convert::From<isize>
serde_json::Number implements std::convert::From<serde_json::de::ParserNumber>
serde_json::Number implements std::convert::From<u16>
serde_json::Number implements std::convert::From<u32>
and 3 others
required for f64 to implement std::convert::Into<serde_json::Number>

lquerel · 2025-05-22T04:13:35Z

src/registry/otlp/conversion.rs

+pub fn span_kind_from_otlp_kind(kind: i32) -> SpanKindSpec {
+    match kind {
+        2 => SpanKindSpec::Server,
+        3 => SpanKindSpec::Client,
+        4 => SpanKindSpec::Producer,
+        5 => SpanKindSpec::Consumer,
+        _ => SpanKindSpec::Internal,
+    }
+}


I’m surprised that a SpanKind enum doesn’t already exist in the code generated by Prost.
In general, when we can reuse existing definitions, we should do so to simplify the maintenance of the code.

lquerel · 2025-05-22T04:13:55Z

src/registry/otlp/conversion.rs

+    if let Some(status) = status {
+        let code = match status.code {
+            1 => StatusCode::Ok,
+            2 => StatusCode::Error,
+            _ => StatusCode::Unset,
+        };


Same comment here.

…ck-metrics

jerbly · 2025-05-24T18:03:24Z

@lquerel - Thank you for your thorough review. I have addressed all your points. Please take another look so we can get this merged. Thanks!

lquerel

Thanks Jeremy for these updates.

jerbly added 8 commits May 4, 2025 14:07

Initial metric support

507fe1c

internal advisor support for basic metrics

0882e88

initial ansi output for metrics

1a9b3e1

number data points

da3112b

added histogram support

2115b99

improve histogram ansi

a6e7426

Merge branch 'main' of github.com:open-telemetry/weaver into live-che…

6fc3eb7

…ck-metrics

fix up after merge

7c62eba

jsuereth reviewed May 5, 2025

View reviewed changes

jerbly added 4 commits May 5, 2025 15:45

metrics test

23f0a0c

rename sample structs

f1f41eb

stats, rego and docs

63204e5

fix test

3cd7a8f

jsuereth reviewed May 6, 2025

View reviewed changes

crates/weaver_live_check/src/sample_metric.rs Show resolved Hide resolved

jsuereth reviewed May 6, 2025

View reviewed changes

jerbly added 3 commits May 6, 2025 10:41

histogram attributes

8247a07

Rego registry inputs, example, test, docs

6eda519

Merge branch 'main' of github.com:open-telemetry/weaver into live-che…

5028a2b

…ck-metrics

jerbly added 10 commits May 11, 2025 15:30

feat: add schemars support for JSON schema generation in weaver_live_…

4416af5

…check and weaver_checker

added handling for exponentialhistogram and summary

728aabe

Merge branch 'main' of github.com:open-telemetry/weaver into live-che…

452df33

…ck-metrics

Refactor to more neatly support Exemplars

bf69e2e

Refactor to impl LiveCheckRunner for collections

2738c6d

Refactor: added Advisable trait to reduce duplication for simple Samp…

5236432

…le types.

Exemplar properties

1e6b535

exemplar custom rego test

84fc9e2

Merge branch 'main' of github.com:open-telemetry/weaver into live-che…

4369d08

…ck-metrics

fix(tests): set default role for test cases in live_checker

2b7f2bc

jerbly marked this pull request as ready for review May 18, 2025 18:55

jerbly requested a review from a team as a code owner May 18, 2025 18:55

jerbly requested a review from jsuereth May 18, 2025 18:55

jerbly changed the title ~~[WIP] Live check metrics~~ Live check metrics May 18, 2025

changelog

29b4f0a

jsuereth added this to OTel Weaver Project May 21, 2025

jsuereth moved this to To consider for the next release in OTel Weaver Project May 21, 2025

jsuereth moved this from To consider for the next release to Next Release in OTel Weaver Project May 21, 2025

lquerel requested changes May 22, 2025

View reviewed changes

jerbly added 2 commits May 24, 2025 13:40

Merge branch 'main' of github.com:open-telemetry/weaver into live-che…

e6f43ea

…ck-metrics

updates from PR review

b2e6752

jerbly requested a review from lquerel May 24, 2025 18:01

lquerel approved these changes May 24, 2025

View reviewed changes

Merge branch 'main' into live-check-metrics

2607acc

jerbly enabled auto-merge (squash) May 24, 2025 22:48

jerbly merged commit 05d9204 into open-telemetry:main May 24, 2025
21 checks passed

github-project-automation bot moved this from Next Release to Done in OTel Weaver Project May 24, 2025

		registry_attribute: Option<&Rc<Attribute>>,
		registry_group: Option<&Rc<ResolvedGroup>>,

	.push((attribute.name.clone(), Rc::clone(&attribute_rc)));
	.push((attribute.name.clone(), attribute_rc.clone()));

Live check metrics #728

Live check metrics #728

Uh oh!

Conversation

jerbly commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerbly May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerbly commented May 24, 2025

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jerbly commented May 5, 2025 •

edited

Loading

jerbly May 6, 2025 •

edited

Loading

codecov bot commented May 7, 2025 •

edited

Loading