-
Notifications
You must be signed in to change notification settings - Fork 14.3k
KAFKA-17503: Fix incorrect calculation of poll-idle-ratio-avg for con… #17139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
long pollTimeMs = pollEndMs - pollStartMs; | ||
double pollIdleRatio = pollTimeMs * 1.0 / (pollTimeMs + timeSinceLastPollMs); | ||
this.pollIdleSensor.record(pollIdleRatio); | ||
this.pollEndMs = pollEndMs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think this wouldn't record a metric when there's only one invocation of poll(). This is also evident from the failing KafkaConsumerTest::testPollIdleRatio
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This is on purpose. If you only have 1 invocation of poll
, then it doesn't make sense to ask "What fraction of the polling interval time was spent inside poll". The interval is not defined yet. The existing implementation just decides that the fraction should be 1.0, but that's not really true, and slightly skews the metric if there are few calls to poll. Until you have a full polling interval (i.e. two calls to poll
), I think the metric should stay undefined.
Thanks for letting me know about the test, I've adjusted the test to account for the changes. I also added a bit to demonstrate the issue: Repeatedly running poll while spending 10ms inside and outside the poll causes the metric to approach 0.33. With these changes, the metric instead approaches 0.5.
Please let me know if the changes look reasonable. I note that there's a similar metric in KafkaShareConsumerMetricsTest
. Should that be updated too? If so, let me know and I'll apply the same changes to that one.
ce75605
to
1bdc192
Compare
…sumers The calculation of the metric was slightly off because the divisor ends up covering more than one full poll cycle. It instead covers a full poll cycle, plus the duration of the latest call to poll. As an example, say a thread consistently spends 10 ms in poll and 10 ms outside poll. We would expect a fraction of 0.5. The metric as it was implemented would return 0.33. At time 0, we enter poll. At time 10, we exit poll, registering a meaningless metric value. At time 20, we enter poll, and update timeSinceLastPollMs to 20. At time 30, we exit poll, and compute the metric value to 10 / (10 + 20) = 0.33 This commit changes the implementation to track the timestamp when we enter and exit poll, and updates the metric when we enter a new poll cycle.
Apologies for the late reply. I stumbled upon a similar metric in KafkaRaftMetrics which was modified in #13207. Perhaps we should consider a similar logic for parity? One concern I have is unlike the KRaft metric, which is limited to the KRaft Controller, a change in the consumer's metric would have a wider impact. WDYT @hachikuji @jsancio ? |
While this is true, I think it's also worth keeping in mind that the metric that exists today is so far off correct that it's not very helpful. In the example of an even split between time inside and outside poll, the current metric will report 33%. That's very far from accurate. |
if (this.pollEndMs != 0) { | ||
long pollIntervalMs = pollStartMs - this.pollStartMs; | ||
long pollTimeMs = this.pollEndMs - this.pollStartMs; | ||
double pollIdleRatio = pollTimeMs * 1.0 / pollIntervalMs; | ||
this.pollIdleSensor.record(pollIdleRatio); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method would benefit from some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add some, though I'll likely move these updates into recordPollEnd along with changing the polling interval to be based on the end timestamps instead of the start timestamps.
The poll(0ms) case you're thinking of I assume is regarding division by 0 here?
// This also means the metric will always exclude the latest poll, since we don't know how much time is spent outside poll for that interval | ||
// until poll is called again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this not be an issue if we instead calculated
(pollEnd - pollStart) / (pollEnd - lastPollEnd)
this would mean our metric includes the latest poll, because it updates at the end of the poll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would probably work. I'll try changing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this, but the consequence is that we will exclude the first poll rather than the last poll.
The issue is essentially that this metric is measuring a property of the interval between two polls, which means you're going to have to either exclude either the first or the last poll depending on whether you choose to start intervals at the start or end of the poll.
If you define the poll interval as being from start to start (as I did here), that means the last poll is excluded because until the poll after that begins, the interval for the last poll has no endpoint.
If you define the poll interval as being from end to end (as you suggest), the first poll is excluded, because there is no starting point for that interval.
Here's a small illustration:
Picking the start time of the poll call as the boundary of the interval, the poll intervals for 3 sequential completed polls look like this:
[start1, start2], [start2, start3], [start3, ?].
By using these boundaries, each poll interval starts when poll is called, and ends when the subsequent poll is called. The time you spent outside of poll in an interval is the time after poll returns.
The third interval has an unknown boundary because that boundary point will be the start time of the fourth poll, and that poll hasn't started yet. This means we have to wait to record that last interval until then, which means the last poll is not included in the metric.
If we instead pick the end time as the boundary, we get these intervals:
[?, end1], [end1, end2], [end2, end3]
By using these boundaries, each poll interval starts when the previous poll returns, and ends when the current poll returns. The time you spent outside of poll in a specific interval is the time before you called poll.
The first interval has an unknown boundary, because there is no "previous poll" to get an end point from, and so that interval will be excluded from the metric.
I don't know which is preferable, they seem pretty equal to me. Do you have an opinion @ahuang98?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, I do see you decided to try implementing TimeRatio so I'll hold off on continuing this thread
@@ -3098,31 +3098,53 @@ public void testPollIdleRatio(GroupProtocol groupProtocol) { | |||
assertEquals(Double.NaN, consumer.metrics().get(pollIdleRatio).metricValue()); | |||
|
|||
// 1st poll | |||
// Spend 50ms in poll so value = 1.0 | |||
// Spend 50ms in poll. value=NaN because "the fraction of time spent inside poll" is undefined until the polling interval has an end point, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the first poll, maybe worth adding an additional sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add that. Is it just because the starting time is never 0 in production, or is there something else you want to cover by doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this increment, but I don't understand what it's supposed to demonstrate. If you let me know what your reasoning is, I can put that in as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize there isn't a benefit to adding the additional sleep, recordPollStart (at least at that commit) doesn't factor in time before the first pollStart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll remove the sleep again then
Interesting, I had a read of #13207 and I would agree that it would make sense to have parity here (given that this PR has been open for some time already, seems worth it to take the additional time to get the ratio right). |
Turn timeSinceLastPollMs into a local Add some comments to recordPollStart Make recordPollStart handle polling intervals of 0 ms by ignoring them, rather than letting the metric value become NaN Try to improve the explanation in KafkaConsumerTest of how this metric is intended to work, and why it excludes the last poll. Move time increments in the test around a bit. A polling interval runs from poll start to poll start, which means that the time in an interval we consider to be "outside of poll" is the time after the poll completes, until the next poll starts. This means the test is easier to read if we increment the time after each poll, rather than before each poll. Add bit to the test that verifies the 0ms interval handling correctly ignores such intervals.
@ahuang98 I have addressed your comments in the latest commit. Please take another look when you get a chance. As outlined in #17139 (comment), I don't think there is any benefit to choosing to measure the intervals between poll endpoints instead of poll start points. We will have to exclude a poll from the metric either way, the only difference is whether it's the first or last poll. So I've stuck to measuring intervals from poll start to poll start for now. |
Regarding #13207, I think that is likely a better solution than the changes made here, since this suffers from exactly the problem described in that PR. Happy to update this to use TimeRatio instead, if people agree that this is the way to go. I have some minor concerns about that class though:
I'm unsure if these matter in practice. The issue TimeRatio is trying to solve is that we're collapsing a bunch of not-equally-large time intervals into ratios and then averaging those ratios, which is wrong. Would we be better off with a metric that makes the windows fixed-width? If polling intervals are defined as running from the start of a poll until the start of the next poll, we know that the non-polling time in an interval "comes after" the polling time in that interval. I think that would let us define fixed-width samples that can be averaged without losing precision. When we hit the first pollStart that exceeds the target time window, we can "backfill" time in the previous window because we know what kind of time is missing from it. Example: Say the window size is 10s Poll 2s Say that we end up calling pollStart at t=13s When we call pollStart, the counters which track all but the most recent poll would say we've spent 2s in poll and 2s outside poll. We'd know that we've accounted for 4/10 seconds. We can then look at the latest poll and "fill in" the interval, drawing first from polling seconds (since we know poll "comes before" non-poll), and next from non-polling seconds to hit the exact window size. In this case we're missing 6 seconds, so we pull that from the poll bucket, and the final numbers for the interval are 2+6 seconds in poll, and 2 seconds outside poll, meaning we'd log a sample ratio of 0.8. The final 1 polling second and 2 non-poll seconds roll over to the next window. We'd start the next window at the end of the previous window, and add the remaning seconds to it. So the next interval in this case starts at 10s, and starts out with 1s spent polling and 2s spent not polling. If the polling interval is much larger than the window size, we can apply the rollover method repeatedly. For example, if the window size is 10s and a poll interval takes 62s, a pollStart call can just submit 6 samples for the 6 intervals covered since the last pollStart, and carry over the last few seconds into the next interval. I think this would let us accurately track the average ratio, without too many edge cases. The metrics reporters would just use |
This is to demonstrate the idea, needs polish before merging.
I've implemented the above idea in the latest commit, as input to the discussion. It's not quite mergeable, but it should be good enough to show that the idea can work. Rather than doing approximation like TimeRatio, it instead divides time into fixed-size windows, and submits one sample to the metric per window. I believe this fixes the issue TimeRatio is also trying to solve, which is that it's bad to do averages over time intervals that are variable size. TimeRatio fixes it by not doing an average and instead ending each sampling window when a reporter asks for a value. This commit fixes it by making the intervals all equal in size. I think the benefits over TimeRatio is that this metric is exact and doesn't have any edge cases related to 0-wide polling intervals, nor does it have any potential threading issues or impact from having multiple reporters looking at the metric. The drawback is that the metric will be slightly behind, since the measured time within a sampling window is only recorded in the metric once the window closes. For example, if you have a sampling window size of 10 seconds, the metric will be showing values for polls between 0 and 10 seconds in the past, depending on how far through the window you are. If we decide that we'd like to go with this type of metric over the TimeRatio approximation, I'd be happy to replace TimeRatio with the new fixed-window version, and adapt the tests as necessary. Otherwise, let me know and I can make this metric use the current version of TimeRatio instead. |
@@ -68,6 +67,7 @@ public KafkaConsumerMetrics(Metrics metrics, String metricGrpPrefix) { | |||
metricGroupName, | |||
"The average fraction of time the consumer's poll() is idle as opposed to waiting for the user code to process records."), | |||
new Avg()); | |||
this.pollIdleRatio = new TimeRatio2(metrics.config().timeWindowMs(), pollIdleSensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unsure whether it would be better to select a smaller time window here. The defaults for the metrics system is a window of 30 seconds, which is a bit large for this purpose, and it also only keeps 2 samples. It might make sense to aim for e.g. a sample every few seconds, or 1/5th the window size the metric system uses, or something like that.
I see, this was hard to visualize, but is the issue essentially - if poll() was called at time=10, TimeRatio.measure() called at time=20, and poll had not completed yet, the 10 ongoing seconds of polling time will not accounted for in the calculated ratio. It is actually treated as non-polling time as well? |
I don't have enough knowledge with how folks generally configure metric reporters to comment on this, but these concerns seem reasonable. I'll see if I can get a second opinion on this I am having a hard time following along with your example - specifically starting at this point
What I've gleaned is that with a window size of 10s, we are able to calculate the ratio for window 1 (which ends during the 6th second of the third poll) and window 2 is partial and includes everything after that. |
Thanks for reviewing. Yes. Polling time can essentially "leak" as long as a poll starts before the measurement window, and finishes during the window. Example: We now have 20 ms of polling time inside an 11 ms measurement window. This causes the metric to report 1.0 and reset the counters to 0. Those remaining 9 ms of polling time have "leaked" at this point, because they're simply lost from the tracking. The next time we call A way to visualize this is that if we have any incomplete polls when we began a measurement window, then when the measurement window ends, we will try to "squeeze in" the pre-window-start polling time by pretending it actually occurred within the measurement window. If there isn't room for that time inside the window, we'll truncate it and report 1.0 for that window. But either way, moving the time to the measurement window is a bit incorrect. If it doesn't fit, we lose time. If it does fit, we report a lower idle time for the window than we really should, because we included polling time in the window that actually didn't belong in that window. That being said, the inaccuracy this introduces is probably minor as long as the measurement windows are large in comparison to the time it takes to run |
Yes, exactly. The basic idea is to let each window be flushed to the metric as soon as we know we've accounted for all seconds from that window. Since windows are fixed-size, we can know when a window is ready to flush. So in the example, once we finish the third poll, we have enough seconds to complete window 1, because we have 2 seconds from the first poll, 2 non-polling seconds, and then 6 polling seconds from the third poll. The last second from the third poll then gets "saved up" for the next window which isn't complete yet. The same thing happens to the two non-polling seconds from the third poll. At some future point, we'll complete polls enough to get the remaining 7 seconds, and then window 2 can be flushed. In the implementation the point where we account for the latest poll's seconds and consider whether to flush the open window is in |
Hm, I take back at least this portion of my statement
It looks like measure ignores currentTimestamp, so it will exclude the current partial poll time from both the divisor and dividend.
|
[edited] I see, I wrote the following test for understanding if partial polls would be accounted for correctly (and it seems to be the case) but I just duplicated your scenario and it doesn't seem to behave nicely (just as you mentioned)
|
Hm, okay your edge case only happens on the call to measure after the first call to record. All following measurements shouldn't have this issue of 'leaking' poll time (because e.g.
Only on startup would we see at most two measurement windows with an inflated value (first window will always report 1.0, and the window where the poll ended due to the edge case you found) if poll time greatly surpasses the measurement window. |
You are completely correct. My mistake. TimeRatio can't "lose time" the way I thought. The scenario I described was wrong, I misunderstood the code. So if the two other concerns mentioned above (thread safety, the effect of multiple metrics reporters) turn out to not be relevant, I guess TimeRatio is fine as is? Not sure if this matters in practice, so maybe just for fun: Here's another example demonstrating funny metric outputs:
In this test, the user will observe the following emitted metric values: 1.0, 0.66, 1.0, 0.5, even though two of those values are meaningless. This happens if |
(haven't read your last example yet, but will today) I think your concern about multithreading behavior is still valid though, if we do decide to change how we calculate poll ratios, perhaps it may make sense to introduce a KIP to get more eyes on the implementation (since this may impact the values of both the consumer and raft poll metric) |
Basically the metric can oscillate if the reporting window is consistently skewed with when polls end. I think the same issue could occur even with pre-determined window sizes right? |
The example I posted above isn't showing a real oscillation, it's showing that we might report the metric default value (in the example, this is 1.0) if |
@srdo what did you think about the KIP? Your point about the multithreading issue is valid, and since fixing TimeRatio would impact the metric reporting for the existing idle ratio metric it would be good to get more eyes on your proposed solution |
…sumers
The calculation of the metric was slightly off because the divisor ends up covering more than one full poll cycle. It instead covers a full poll cycle, plus the duration of the latest call to poll.
As an example, say a thread consistently spends 10 ms in poll and 10 ms outside poll. We would expect a fraction of 0.5.
The metric as it was implemented would return 0.33.
At time 0, we enter poll.
At time 10, we exit poll, registering a meaningless metric value.
At time 20, we enter poll, and update timeSinceLastPollMs to 20.
At time 30, we exit poll, and compute the metric value to 10 / (10 + 20) = 0.33
This commit changes the implementation to track the timestamp when we enter and exit poll, and updates the metric when we enter a new poll cycle.
Committer Checklist (excluded from commit message)