-
Notifications
You must be signed in to change notification settings - Fork 14
Proposal: Add type and unit metadata labels #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: David Ashpole <[email protected]>
8aab401
to
25305b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I think we need to discuss a bit more whether to handle __*
labels in the PromQL engine or the UI. Maybe there's a chance we're doing a breaking change...?
Signed-off-by: David Ashpole <[email protected]>
98a86bd
to
e56b2be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for proposing this. I love it in general, and as said elsewhere, I had a very similar idea a while ago for completely different reasons (mostly to avoid mixed series with native histograms and float-based sample types). Finding a similar solution for unrelated problems seems like a signal that this could be actually useful. However, I have also run into some conundrums, which was the reason why I put the whole idea on the backburner. I'll try to sketch out my train of thought here:
The __name__
label is the precedent here for "special labels", and as such it gives us a hint of what the issues might be. It's special power is not just about how to display it. It is treated specially for label matching and aggregation, and it is removed by most operations, following the argument that a rate of process_cpu_seconds_total
is not process_cpu_seconds_total
anymore. We have to ask ourselves the same questions for __unit__
and __type__
.
In the brave new world, process_cpu_seconds_total
would probably look like process_cpu{__type__="counter", __unit__="seconds"}
. If we rate that, the outcome is not a counter anymore (but a gauge), and it is actually unit-less (seconds per second, i.e. it is the CPU usage ratio. In wishful thinking, the outcome should be process_cpu_usage{__type__="gauge", __unit__="ratio"}
. Of course, we cannot easily come up with a general procedure to make up new names (which is the reason why current PromQL simply removes the name), and even changing the unit is tough. Changing the type might actually be feasible (we explored that a bit with native histograms, which "know" if they are a counter histogram or a gauge histogram). So maybe we can come up with "type translation rules" for all PromQL operations, and then maybe drop the unit in doubt (when calculating a rate
or multiplying) and keep it when it makes sense (a sum aggregation or adding). But you will notice how deep we are in the woods here already.
The next thing is aggregation and label matching. As said, __name__
is also special here, but we probably need to be special in different ways for __unit__
and __type__
. For example if we simply do a + b
, we probably want to include unit and type in the label match (only add gauges to gauges, only add seconds to seconds etc.). However, a * b
is already very different. b
might be a scaling factor, i.e. a unit-less gauge. a
could be counters or histograms. So in this case, we would like to exclude the unit and type from the label match.
It gets very confusing quickly. I wish we could just add the __unit__
and __type__
as an experiment without further special treatment and see how people cope with it. But right now my worry is it will hit too many road bumps…
Still in brainstorming mode, here is an idea how maybe a step zero could look like: Add the
However, aggregations and label matches would ignore From there, we could cautiously move forward with more ideas. A step one might then be to handle |
Signed-off-by: David Ashpole <[email protected]>
Signed-off-by: David Ashpole <[email protected]>
Hm, I think it would be also nice to display them when collision actually occurs 🤔 |
Signed-off-by: David Ashpole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Looking good, some comments.
Signed-off-by: David Ashpole <[email protected]>
Signed-off-by: David Ashpole <[email protected]>
15b0c44
to
a2e5ed9
Compare
Signed-off-by: David Ashpole <[email protected]>
Signed-off-by: David Ashpole <[email protected]>
Signed-off-by: David Ashpole <[email protected]>
Signed-off-by: David Ashpole <[email protected]>
d4ecde7
to
bce6d9c
Compare
@beorn7 I've added "Aggregations and label matches ignore |
Quick note about federation: To my knowledge, federation still completely ignores metadata, so the exposition format it generates has all metric types as "untyped" and no units and help strings anyway, simply because Prometheus doesn't know about metadata post-ingestion. So being able to funnel unit and type through the Prometheus TSDB via labels would be a direct improvement for federation. |
I think it would be better to enable metadata post-scrape/federation than adding type and unit as labels in the exposition format. See #39 (comment) |
The problem is that federation breaks the assumption that all metrics of the same name (one metric family) all have the same metadata. |
Yea, I wonder if we could lift this with OM 2.0. WDYT @beorn7 ? Related discussion: #39 (comment) |
The current structure of both protobuf and text versions of OM doesn't really lend itself to multiple different type, unit, or help entries for metrics with the same name. So we needed a new structural way of specifying unit and type (let's ignore help for now) per metric (rather than per metric family). Ironically, we already have a place for things that are per metric. We call it labels. So let's put type and metric into labels? Wait… that's what this proposal actually proposes. 😁 |
Well yes, but we just discussed that type and unit per label will not be easily parsable in OM/text, so putting it somewhere else actually helps. There might be a big questions there:
|
I wouldn't say that this is actually the case. The way we parse the labels doesn't really create a lot of parsing cost if there are more labels. It does create more payload, but not so much in relative terms. The OM text format is designed to have a lot of repetitive information anyway. If we want to avoid repetitive information, we needed a fundamentally different structure.
It correctly models the per-scrape data model. It avoids (a part of the) repetition. Of course you can flat that out, but then you get the repetition that was marked above as "not easily parseable". My claim is that "flattening that out" is pretty much equivalent to "adding that as labels".
Well, the text format originally was designed to not be ordered in any way. So every line was keyed by the metric family name, i.e. a metric line starts with the metric family name, and all the metadata lines contain the metric family name as the 2nd token after the hash. OM became somewhat dependent on order, but not consequently so (you could avoid a lot of repetition if you did that) (another of the design problems I see with OM). With the original idea of the text format structure, you needed to repeat not just the metric family name on each line, but the combination of metric family name, type, and unit (which, you guessed it, is equivalent to putting type and unit into a label). Alternatively, we could redesign the text format fundamentally and make it really depend on the order everywhere, but then we should do it thoroughly and avoid other repetitions, too (at the very least the metric family name). Protobuf is a bit different as it is structured, and all the family members are a repeated |
I thought it's more the fact we usually need to know what's the metric type (and metric family name) ahead of time (e.g. histograms have a different flow). But you are right, perhaps we could make label based approach work fine. Then it's only human readability question and what's defined in SDK and what's queried.
I think I got it now, FAIK this only makes a difference for unstructured types like classic histogram and summary where metric name you define in SDKs (aka metric family name) is != resulting metric name. In the word with native histograms and perhaps native summaries one day, it would make no different, right?
Agree.
Got it, thank you for detailed explanation! I think we should discuss those options in a clear proposal, make some decision on this in OM 2.0, will add issue for this.
Nice! Proposing as an intention in our OM 2.0 doc |
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
PR with the implementation is also available: prometheus/prometheus#16228 Signed-off-by: bwplotka <[email protected]>
Draft PR for the initial state of a I also went ahead and updated this proposal:
Refreshed goals:
I am not sure of the following statement:
Not going to implement this in the first iteration of the |
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 from me from the perspective of allowing Prometheus to expand to support Otel without sacrificing its specificity and preferred way of doing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern I have with the proposal is the mention of extra PromQL functionality in the future. I'm not sure we want it and I'm a bit worried if this proposal implies that we'll do it at some point. I'm not blocking the proposal since you mentioned that we might work on it!!!
I've added a few nits here and there, but overall LGTM!
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Experimental implementation of prometheus/proposals#39 Previous (unmerged) experiments: * main...dashpole:prometheus:type_and_unit_labels * #16025 Signed-off-by: bwplotka <[email protected]>
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Fiona Liao <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Thanks all for comments, will update this week! |
Addressed comments and slightly adjusted the proposal. Bigger changes:
PTAL for the final (hopefully) review! |
Bigger changes: * Postponed Prom UI changes to extension. I don't think it's healthy to do this now. * Clarified classic histograms and summaries challenge. * Clarified rejection of PromQL short-syntax for now. * Important: I noticed elephant in the room, so really undefined type and unit values. I expanded in the proposal the challenge and the next steps. I think PromQL should define syntax of type that it supports and normalization for "unknown" type. We could also consider tight definition of units e.g. Otel is requiring ugly and complex but flexible UCUM standard for units. Signed-off-by: bwplotka <[email protected]>
Co-authored-by: Joe Adams <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
|
||
One could try to define standard translations or required subset of supported types in PromQL e.g. [the lowercase OpenMetrics types](https://github.com/prometheus/prometheus/blob/2aaafae36fc0ba53b3a56643f6d6784c3d67002a/model/textparse/openmetricsparse.go#L464). This is essential if we want to have robust type or unit aware functions and operations in PromQL one day. Also, it's critical for the proposal of noticing mixed types being passed through the PromQL engine or auto-converting units. | ||
|
||
One alternative is to say OpenMetrics types and units and everything else, before going to PromQL should be translated to OpenMetrics defined types and units. This is a bit limited, because Prometheus does not have native support to stateset and info metrics. We also plan to add delta type, which only exists in OTLP. For units, generally is not strictly defined in OpenMetrics, and [OTLP UCUM](https://unitsofmeasure.org/ucum) generally offers more functionality (e.g. standard way of representing concrete amount of batches of units e.g. 100 seconds). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence of this paragraph is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very small comments and questions. Overall LGTM
Thanks Bartek, David and everyone else who worked on this ❤️
|
||
Generally clients should never expose type and unit labels as it's a special label starting with `__`. However, it can totally happen by accident or for custom SDKs and exporters. That's why any existing user provided labels for `__unit__` and `__type__` should be overridden by the existing metadata mechanisms in current exposition and ingestion formats. Typeless (including unknown type), nameless and unless entries will NOT produce any labels. | ||
|
||
For PRW 1.0, this logic is omitted because metadata is sent separately from timeseries, making it infeasible to add the labels at ingestion time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand the sentence here... You're saying that PRW v1 will never be able to produce the __unit__
and __type__
labels, and we'll drop them if metrics are sent with those labels even without a proper replacement. Is that correct?
|
||
This solution solves all goals mentioned in [Goals](#goals). It also comes with certain disadvantages: | ||
|
||
* As [@pracucci mentioned](https://github.com/prometheus/proposals/pull/39/files#r19428174750), this change will technically allow users to query for "all" counters or "all" metrics with units which will likely pose DoS/cost for operators, long term storage systems and vendors. Given existing TSDB indexing, `__type__` and `__unit__` postings will have extreme amount of series referenced. More work **has to be done to detect, handle or even forbid such selectors, on their own.**. On top of that TSDB posting index size will increase too. This is however similar to any popular labels like `env=prod`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm missing something here, isn't --query.max-samples an alternative for protection?
|
||
## Open Questions | ||
|
||
1. Should we actually clearly [define PromQL supported types and units in this proposal](#more-strict-unit-and-type-value-definition)? Is there a way to delegate it to another one given a scope? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all parsers included in Prometheus already do this validation, no? I'm not 100% sure, but I think we can't expose metric types that Prometheus doesn't support.. it will fail the scrape with errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Explain the problem more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment on clarifying what remote write client does (or point me to where it's already defined :) )
|
||
### Prometheus Server Ingestion | ||
|
||
When receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats, the type and unit of the metric are interpreted and added as the `__type__` and `__unit__` labels. The Prometheus interpretation may change in the future and it depends on the parser/ingestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the Remote Write clients?
PRW 2.0: options:
- send both
__type__
and__unit__
labels and metadata (krajo: this seems wasteful on account of duplicate data), - do not send
__type__
,__unit__
labels and rely on receiver to add them back based on the metadata (krajo: my choice, backwards compatible), - send
__type__
and__unit__
labels and no metadata (krajo: not possible because of missing help text).
PRW 1.0:
- send both
__type__
and__unit__
labels and metadata (krajo: seems wasteful on account of no symbol table for the labels), - do not send
__type__
,__unit__
labels and rely on receiver to add them back based on the metadata (krajo: not possible because metadata is sent out of band), - send
__type__
and__unit__
labels and no metadata (krajo: not possible because of missing help text).
Co-authored-by: Carrie Edwards <[email protected]> Signed-off-by: Arthur Silva Sens <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I had the same question as Arthur around whether user-provided type and unit labels are always removed like in the case of RW 1.0 (#39 (comment)).
Some screenshots from the Prometheus UI after this PoC: prometheus/prometheus#15683
The PoC updates all PromQL tests to demonstrate that adding type and unit labels doesn't break existing queries.
Querying for
__unit__