-
Notifications
You must be signed in to change notification settings - Fork 14
Proposal: OTEL delta temporality support #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1c3370c
to
3c9ea52
Compare
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
Signed-off-by: Fiona Liao <[email protected]>
3c9ea52
to
8094034
Compare
Signed-off-by: Fiona Liao <[email protected]>
I'm planning to make some updates to the proposal after the dev summit (the consensus was that we want to support push-based usage cases and continue to explore delta support) and some additional discussion with @beorn7 . Based on these discussions, I also have a PR up for basic native delta support (behind a feature flag) in Prometheus - this just stores the sample value at Additional updates to make:
I'm out for the next week, but will apply the updates after that |
* Simplified proposal - moved CT-per-sample to possible future extension instead of embedding within proposal * Changed proposal to have a new `__temporality__` label instead of extending `__type__` - probably better to keep metric type concept distinct from metric temporality. This also aligns with how OTEL models it. * Updated remote-write section - delta ingestion will actually be fully supported via remote write (since CT-per-sample is moved out of main proposal for now) * Moved temporary `delta_rate()` and `delta_increase()` functions suggestion to discarded alternative - not sure this is actually necessary if we have feature flag for temporality-aware functions anyway * Fleshed out implementation plan Signed-off-by: Fiona Liao <[email protected]>
Updates:
|
|
||
#### rate() calculation | ||
|
||
In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We skip the value of the first sample as we do not know its interval.
Perhaps we could get utility out of the first sample's value by guessing that each sample's interval is equal to the average time between samples in the window. One motivation for this is a case that we see often with non-sparse deltas produced by an intermediate processor.
Suppose the actual instrumented app is sending delta samples at a regular 60s interval. We'll assume for simplicity that these deltas are always greater than zero. Then there is an intermediate processor that's configured to accumulate data and flush every 30s. To avoid spareness, it's configured to flush a 0 value if nothing was seen.
The data stream will look like this, with a sample every 30s:
5 0 2 0 10 0 8 0
Note that every other value is 0 because of the mismatch between the flush intervals of the instrumented app and the intermediate processor.
If we then do a rate(...[1m])
on this timeseries, with the current proposal, we might end up with the 1m windows broken up like this:
5 0 | 2 0 | 10 0 | 8 0
If we can't make use of the value from the first sample in each window, we will end up computing a rate of 0
for all of these windows. That feels like it fails to make use of all the available information, since as humans we can clearly see that the rate was not zero.
If instead we guess that each sample represents the delta for a 30s interval, because that's the average distance between the two datapoints in the window, then we will compute the correct rates. Of course it was only a guess and you could contrive a scenario that would fool the guess, but the idea would be to assume that the kind of scenario described here is more likely than those weird ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cumulative series could generate the same values (with the zeroes being where the series resets). And in that case rate() would return no results. So though this doesn't accurately capture the rate, the behaviour would be consistent for cumulative and delta metrics.
However, the first point in a window is more meaningful in the delta case - you know it's a delta from the preceeding sample while in the cumulative case you have to look outside the window to get the same information, so maybe we should do better because of that. That example is leaning me more towards "just do sum_over_time() / range for delta rate()" - in this case that would probably give more useful information. Or at least do that before CT-per-sample available, at which point we'd have more accurate interval data.
|
||
Downsides: | ||
|
||
* This will not work if there is only a single sample in the range, which is more likely with delta metrics (due to sparseness, or being used in short-lived jobs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cumulative samples, it makes sense that with a single sample in the window you can't guess anything at all about how much increase happened in that window. With a single delta sample, even if we don't know the start time, we should be able to make a better guess than "no increase happened".
For example, we could guess that the interval is equal to the window size -- in other words return the single delta value as is with no extrapolation. The argument would be that you picked an arbitrary window of size T and found 1 sample, so the best guess for the frequency of the samples is 1/T. This seems like it would be more useful on average than returning no value in the case of a single sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is with mixing extrapolation and non-extrapolation logic because that might end up surprising users.
if we do decide to generally extrapolate to fill the whole window, but have this special case for a single datapoint, someone might rely on the non-extrapolation behaviour and get surprised when there are two points and it changes .
…on type and metadata Signed-off-by: Fiona Liao <[email protected]>
d0474da
to
f2433c8
Compare
Next steps for this proposal:
|
A proposal for prometheus/prometheus#12763