Skip to content

[WIP] Live check metrics #728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
19 changes: 15 additions & 4 deletions crates/weaver_live_check/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@ As mentioned, a list of `Advice` is returned in the report for each sample entit
}
```

> **Note**
> The `live_check_result` object augments the sample entity at the pertinent level in the structure. If the structure is `metric`->`[number_data_point]`->`[attribute]`, advice should be give at the `number_data_point` level for, say, required attributes that have not been supplied. Whereas, attribute advice, like `missing_attribute` in the JSON above, is given at the attribute level.

### Custom advisors

Use the `--advice-policies` command line option to provide a path to a directory containing Rego policies with the `live_check_advice` package name. Here's a very simple example that rejects any attribute name containing the string "test":
Expand All @@ -118,8 +121,8 @@ import rego.v1

# checks attribute name contains the word "test"
deny contains make_advice(advice_type, advice_level, value, message) if {
input.attribute
value := input.attribute.name
input.sample.attribute
value := input.sample.attribute.name
contains(value, "test")
advice_type := "contains_test"
advice_level := "violation"
Expand All @@ -135,7 +138,13 @@ make_advice(advice_type, advice_level, value, message) := {
}
```

`input` contains the sample entity for assessment wrapped in a type e.g. `input.attribute` or `input.span`. `data` contains a structure derived from the supplied `Registry`. A jq preprocessor takes the `Registry` (and maps for attributes and templates) to produce the `data` for the policy. If the jq is simply `.` this will passthrough as-is. Preprocessing is used to improve Rego performance and to simplify policy definitions. With this model `data` is processed once whereas the Rego policy runs for every sample entity as it arrives in the stream.
`input.sample` contains the sample entity for assessment wrapped in a type e.g. `input.sample.attribute` or `input.sample.span`.

`input.registry_attribute`, when present, contains the matching attribute definition from the supplied registry.

`input.registry_group`, when present, contains the matching group definition from the supplied registry.

`data` contains a structure derived from the supplied `Registry`. A jq preprocessor takes the `Registry` (and maps for attributes and templates) to produce the `data` for the policy. If the jq is simply `.` this will passthrough as-is. Preprocessing is used to improve Rego performance and to simplify policy definitions. With this model `data` is processed once whereas the Rego policy runs for every sample entity as it arrives in the stream.

To override the default Otel jq preprocessor provide a path to the jq file through the `--advice-preprocessor` option.

Expand Down Expand Up @@ -202,7 +211,9 @@ These should be self-explanatory, but:
- `no_advice_count` is the number of samples that received no advice
- `seen_registry_attributes` is a record of how many times each attribute in the registry was seen in the samples
- `seen_non_registry_attributes` is a record of how many times each non-registry attribute was seen in the samples
- `registry_coverage` is the fraction of seen registry attributes over the total registry attributes
- `seen_registry_metrics` is a record of how many times each metric in the registry was seen in the samples
- `seen_non_registry_attributes` is a record of how many times each non-registry metric was seen in the samples
- `registry_coverage` is the fraction of seen registry entities over the total registry entities

This could be parsed for a more sophisticated way to determine pass/fail in CI for example.

Expand Down
98 changes: 98 additions & 0 deletions crates/weaver_live_check/data/metrics.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
[
{
"metric": {
"data_points": [
{
"attributes": [
{
"name": "state",
"value": "used"
}
],
"value": 26963050496
},
{
"attributes": [
{
"name": "state",
"value": "free"
}
],
"value": 586153984
},
{
"attributes": [
{
"name": "system.memory.state",
"value": "inactive"
}
],
"value": 681053388.8
}
],
"instrument": "updowncounter",
"name": "system.memory.usage",
"unit": "By"
}
},
{
"metric": {
"data_points": [
{
"attributes": [],
"bucket_counts": [
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"count": 1,
"max": 7.0015,
"min": 7.0015,
"sum": 7.0015
}
],
"instrument": "histogram",
"name": "rpc.client.duration",
"unit": "ms"
}
},
{
"metric": {
"data_points": [
{
"attributes": [],
"value": 151552000
}
],
"instrument": "gauge",
"name": "otelcol_process_memory_rss",
"unit": "By"
}
},
{
"metric": {
"data_points": [
{
"attributes": [],
"value": 39585616
}
],
"instrument": "counter",
"name": "otelcol_process_runtime_total_alloc_bytes",
"unit": "By"
}
}
]
54 changes: 54 additions & 0 deletions crates/weaver_live_check/data/model/metrics/metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
groups:
- id: registry.system.memory
type: attribute_group
display_name: System Memory Attributes
brief: "Describes System Memory attributes"
attributes:
- id: system.memory.state
type:
members:
- id: used
value: "used"
stability: development
- id: free
value: "free"
stability: development
- id: shared
value: "shared"
stability: development
deprecated: "Removed, report shared memory usage with `metric.system.memory.shared` metric"
- id: buffers
value: "buffers"
stability: development
- id: cached
value: "cached"
stability: development
stability: development
brief: "The memory state"
examples: ["free", "cached"]

# system.* metrics
- id: metric.system.uptime
type: metric
metric_name: system.uptime
stability: development
brief: "The time the system has been running"
note: |
Instrumentations SHOULD use a gauge with type `double` and measure uptime in seconds as a floating point number with the highest precision available.
The actual accuracy would depend on the instrumentation and operating system.
instrument: gauge
unit: "s"

# system.memory.* metrics
- id: metric.system.memory.usage
type: metric
metric_name: system.memory.usage
stability: development
brief: "Reports memory in use by state."
note: |
The sum over all `system.memory.state` values SHOULD equal the total memory
available on the system, that is `system.memory.limit`.
instrument: updowncounter
unit: "By"
attributes:
- ref: system.memory.state
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ import rego.v1

# checks attribute name contains the word "test"
deny contains make_advice(advice_type, advice_level, value, message) if {
input.attribute
value := input.attribute.name
input.sample.attribute
value := input.sample.attribute.name
contains(value, "test")
advice_type := "contains_test"
advice_level := "violation"
Expand All @@ -14,8 +14,8 @@ deny contains make_advice(advice_type, advice_level, value, message) if {

# checks span name contains the word "test"
deny contains make_advice(advice_type, advice_level, value, message) if {
input.span
value := input.span.name
input.sample.span
value := input.sample.span.name
contains(value, "test")
advice_type := "contains_test"
advice_level := "violation"
Expand All @@ -24,8 +24,8 @@ deny contains make_advice(advice_type, advice_level, value, message) if {

# checks span status message contains the word "test"
deny contains make_advice(advice_type, advice_level, value, message) if {
input.span
value := input.span.status.message
input.sample.span
value := input.sample.span.status.message
contains(value, "test")
advice_type := "contains_test_in_status"
advice_level := "violation"
Expand All @@ -34,14 +34,26 @@ deny contains make_advice(advice_type, advice_level, value, message) if {

# checks span_event name contains the word "test"
deny contains make_advice(advice_type, advice_level, value, message) if {
input.span_event
value := input.span_event.name
input.sample.span_event
value := input.sample.span_event.name
contains(value, "test")
advice_type := "contains_test"
advice_level := "violation"
message := "Name must not contain 'test'"
}

# This example shows how to use the registry_group provided in the input.
# If the metric's unit is "By" the value in this data-point must be an integer.
deny contains make_advice(advice_type, advice_level, value, message) if {
input.sample.number_data_point
value := input.sample.number_data_point.value
input.registry_group.unit == "By"
value != floor(value) # not a good type check, but serves as an example
advice_type := "invalid_data_point_value"
advice_level := "violation"
message := "Value must be an integer when unit is 'By'"
}

make_advice(advice_type, advice_level, value, message) := {
"type": "advice",
"advice_type": advice_type,
Expand Down
Loading
Loading