Skip to content

Live check metrics #728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 24, 2025
Merged

Live check metrics #728

merged 29 commits into from
May 24, 2025

Conversation

jerbly
Copy link
Contributor

@jerbly jerbly commented May 5, 2025

Adds support for Metrics to live-check. The nested nature of this data brought about further refactoring internally but also a change to the rego interface. input now has sample and registry_attribute or registry_group - this allows the policy to work within the nested context. For example, providing some advice on an exemplar in a data-point in a metric by referring to the metric's defined unit. This is documented in the readme and an example policy is included in the tests.

image image

Includes support for Exemplars:

image

/// A sample metric
Metric(&'a SampleMetric),
/// A sample number data point
NumberDataPoint(&'a NumberDataPoint),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - Is the convention going to be check at the highest level, but higher-level things don't include lower level details?

E.g. NumberDataPoint is part of a Metric and has metric identifying things, but SampleMetric wouldn't?

Also, can you rename to SampledNumberDataPoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This area is not quite right yet. A NumberDataPoint without the information from the Metric is not so useful, but I need a way to pass it in around in a consistent way.

Yes, good call on the renaming. Will do.

Copy link
Contributor Author

@jerbly jerbly May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into the consistency of the framework. For each data_point, I want to be able to provide advice at that level (checking attributes required for example), but I need context from the parent group. I'm hoping to get this change done today. This will change rego so the input is a tuple of (sample,semconv).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed this with docs to explain and a rego example.

# This example shows how to use the registry_group provided in the input.
# If the metric's unit is "By" the value in this data-point must be an integer.
deny contains make_advice(advice_type, advice_level, value, message) if {
	input.sample.number_data_point
	value := input.sample.number_data_point.value
	input.registry_group.unit == "By"
	value != floor(value) # not a good type check, but serves as an example
	advice_type := "invalid_data_point_value"
	advice_level := "violation"
	message := "Value must be an integer when unit is 'By'"
}

@@ -40,6 +40,16 @@ deny contains make_advice(advice_type, advice_level, value, message) if {
message := "Does not match name formatting rules"
}

# checks metric name format
deny contains make_advice(advice_type, advice_level, value, message) if {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check to make sure attributes on the metric match attributes defined on the group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done internally in rust rather than in rego since I assumed this was a fundamental thing. See the attribute_required advice under each data_point in the screenshot above. This is at the data_point level since the attributes are provided with each point.

Copy link

codecov bot commented May 7, 2025

Codecov Report

Attention: Patch coverage is 91.02564% with 21 lines in your changes missing coverage. Please review.

Project coverage is 77.1%. Comparing base (9d8a5d7) to head (2607acc).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/weaver_live_check/src/advice.rs 83.1% 13 Missing ⚠️
crates/weaver_live_check/src/lib.rs 80.9% 8 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #728     +/-   ##
=======================================
+ Coverage   76.7%   77.1%   +0.4%     
=======================================
  Files         65      66      +1     
  Lines       5036    5211    +175     
=======================================
+ Hits        3863    4019    +156     
- Misses      1173    1192     +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jerbly jerbly marked this pull request as ready for review May 18, 2025 18:55
@jerbly jerbly requested a review from a team as a code owner May 18, 2025 18:55
@jerbly jerbly requested a review from jsuereth May 18, 2025 18:55
@jerbly jerbly changed the title [WIP] Live check metrics Live check metrics May 18, 2025
@jsuereth jsuereth moved this to To consider for the next release in OTel Weaver Project May 21, 2025
@jsuereth jsuereth moved this from To consider for the next release to Next Release in OTel Weaver Project May 21, 2025
Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Adding metrics support will significantly enhance the live-check feature. I’ve made several suggestions. There are certain types we can avoid duplicating (especially the leaf enums). For the other structures, I understand that you’re adding at least one custom field for the results. It’s frustrating to see all this duplicated code, but unfortunately, I don’t see a straightforward alternative solution.

It would also be good to remove instances of &Rc<T> or Rc<Rc<T>>, as they seem semantically redundant and not very idiomatic.

@@ -202,7 +211,9 @@ These should be self-explanatory, but:
- `no_advice_count` is the number of samples that received no advice
- `seen_registry_attributes` is a record of how many times each attribute in the registry was seen in the samples
- `seen_non_registry_attributes` is a record of how many times each non-registry attribute was seen in the samples
- `registry_coverage` is the fraction of seen registry attributes over the total registry attributes
- `seen_registry_metrics` is a record of how many times each metric in the registry was seen in the samples
- `seen_non_registry_attributes` is a record of how many times each non-registry metric was seen in the samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not that be seen_non_registry_metrics ? The comment talks about metrics and not attributes.

Comment on lines 43 to 44
registry_attribute: Option<&Rc<Attribute>>,
registry_group: Option<&Rc<ResolvedGroup>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but having a function signature with Option<&Rc<Attribute>> feels a bit like having a double reference to something. It would seem more idiomatic to me to just have Option<Rc<Attribute>>.

Comment on lines 62 to 63
registry_attribute: Option<&Rc<Attribute>>,
registry_group: Option<&Rc<ResolvedGroup>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Comment on lines 161 to 174
for required_attribute in required_attributes {
// Check if the attribute is present in the sample
if !attributes
.iter()
.any(|attribute| attribute.name == required_attribute.name)
{
advice_list.push(Advice {
advice_type: "attribute_required".to_owned(),
value: Value::String(required_attribute.name.clone()),
message: "Attribute is required".to_owned(),
advice_level: AdviceLevel::Violation,
});
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's n*m complexity when n is the number of required attributes and m is the number of sample attributes. It might be useful for protecting the live-check function to 1) convert one of these two lists into a map, and 2) check the size of the list coming from the network to avoid any issues.

templates_by_length.push((attribute.name.clone(), attribute.clone()));
let _ = semconv_templates.insert(attribute.name.clone(), attribute.clone());
templates_by_length
.push((attribute.name.clone(), Rc::clone(&attribute_rc)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.push((attribute.name.clone(), Rc::clone(&attribute_rc)));
.push((attribute.name.clone(), attribute_rc.clone()));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attribute_rc is already an Rc, so I don't see the point of creating an Rc<&Rc<Attribute>>.

Comment on lines 25 to 38
pub enum SampleInstrument {
/// An up-down counter metric.
UpDownCounter,
/// A counter metric.
Counter,
/// A gauge metric.
Gauge,
/// A histogram metric.
Histogram,
/// A summary metric. This is no longer used and will cause a violation.
Summary,
/// Unspecified instrument type.
Unspecified,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the type level, do we have a way to keep in-sync the SampleInstrument and the InstrumentSpec enums? Why not using directly InstrumentSpec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstrumentSpec does not support Summary, but OTLP does.

I'd recommend creating a wrapper enum where it's either SupportedMetricType(InstrumentSpec) or UnsupportedMetricType(String)

match value {
GrpcValue::StringValue(string) => Some(Value::String(string)),
GrpcValue::IntValue(int_value) => Some(Value::Number(int_value.into())),
GrpcValue::DoubleValue(double_value) => Some(json!(double_value)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity. Why json!(double_value) and not Value::Number(double_value.into())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the trait bound serde_json::Number: std::convert::From<f64> is not satisfied
the following other types implement trait std::convert::From<T>:
serde_json::Number implements std::convert::From<i16>
serde_json::Number implements std::convert::From<i32>
serde_json::Number implements std::convert::From<i64>
serde_json::Number implements std::convert::From<i8>
serde_json::Number implements std::convert::From<isize>
serde_json::Number implements std::convert::From<serde_json::de::ParserNumber>
serde_json::Number implements std::convert::From<u16>
serde_json::Number implements std::convert::From<u32>
and 3 others
required for f64 to implement std::convert::Into<serde_json::Number>

Comment on lines 63 to 71
pub fn span_kind_from_otlp_kind(kind: i32) -> SpanKindSpec {
match kind {
2 => SpanKindSpec::Server,
3 => SpanKindSpec::Client,
4 => SpanKindSpec::Producer,
5 => SpanKindSpec::Consumer,
_ => SpanKindSpec::Internal,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m surprised that a SpanKind enum doesn’t already exist in the code generated by Prost.
In general, when we can reuse existing definitions, we should do so to simplify the maintenance of the code.

Comment on lines 77 to 82
if let Some(status) = status {
let code = match status.code {
1 => StatusCode::Ok,
2 => StatusCode::Error,
_ => StatusCode::Unset,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

@jerbly jerbly requested a review from lquerel May 24, 2025 18:01
@jerbly
Copy link
Contributor Author

jerbly commented May 24, 2025

@lquerel - Thank you for your thorough review. I have addressed all your points. Please take another look so we can get this merged. Thanks!

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jeremy for these updates.

@jerbly jerbly enabled auto-merge (squash) May 24, 2025 22:48
@jerbly jerbly merged commit 05d9204 into open-telemetry:main May 24, 2025
21 checks passed
@github-project-automation github-project-automation bot moved this from Next Release to Done in OTel Weaver Project May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants