Skip to content

KAFKA-17503: Fix incorrect calculation of poll-idle-ratio-avg for con… #17139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

srdo
Copy link
Contributor

@srdo srdo commented Sep 9, 2024

…sumers

The calculation of the metric was slightly off because the divisor ends up covering more than one full poll cycle. It instead covers a full poll cycle, plus the duration of the latest call to poll.

As an example, say a thread consistently spends 10 ms in poll and 10 ms outside poll. We would expect a fraction of 0.5.

The metric as it was implemented would return 0.33.

At time 0, we enter poll.
At time 10, we exit poll, registering a meaningless metric value.
At time 20, we enter poll, and update timeSinceLastPollMs to 20.
At time 30, we exit poll, and compute the metric value to 10 / (10 + 20) = 0.33

This commit changes the implementation to track the timestamp when we enter and exit poll, and updates the metric when we enter a new poll cycle.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

long pollTimeMs = pollEndMs - pollStartMs;
double pollIdleRatio = pollTimeMs * 1.0 / (pollTimeMs + timeSinceLastPollMs);
this.pollIdleSensor.record(pollIdleRatio);
this.pollEndMs = pollEndMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this wouldn't record a metric when there's only one invocation of poll(). This is also evident from the failing KafkaConsumerTest::testPollIdleRatio test.

Copy link
Contributor Author

@srdo srdo Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is on purpose. If you only have 1 invocation of poll, then it doesn't make sense to ask "What fraction of the polling interval time was spent inside poll". The interval is not defined yet. The existing implementation just decides that the fraction should be 1.0, but that's not really true, and slightly skews the metric if there are few calls to poll. Until you have a full polling interval (i.e. two calls to poll), I think the metric should stay undefined.

Thanks for letting me know about the test, I've adjusted the test to account for the changes. I also added a bit to demonstrate the issue: Repeatedly running poll while spending 10ms inside and outside the poll causes the metric to approach 0.33. With these changes, the metric instead approaches 0.5.

Please let me know if the changes look reasonable. I note that there's a similar metric in KafkaShareConsumerMetricsTest. Should that be updated too? If so, let me know and I'll apply the same changes to that one.

@srdo srdo force-pushed the idle-ratio-fix branch 3 times, most recently from ce75605 to 1bdc192 Compare September 13, 2024 15:54
srdo added 2 commits October 24, 2024 22:18
…sumers

The calculation of the metric was slightly off because the divisor
ends up covering more than one full poll cycle. It instead covers a full
poll cycle, plus the duration of the latest call to poll.

As an example, say a thread consistently spends 10 ms in poll and 10 ms
outside poll. We would expect a fraction of 0.5.

The metric as it was implemented would return 0.33.

At time 0, we enter poll.
At time 10, we exit poll, registering a meaningless metric value.
At time 20, we enter poll, and update timeSinceLastPollMs to 20.
At time 30, we exit poll, and compute the metric value to
10 / (10 + 20) = 0.33

This commit changes the implementation to track the timestamp when we enter
and exit poll, and updates the metric when we enter a new poll cycle.
@gaurav-narula
Copy link
Contributor

Apologies for the late reply. I stumbled upon a similar metric in KafkaRaftMetrics which was modified in #13207. Perhaps we should consider a similar logic for parity? One concern I have is unlike the KRaft metric, which is limited to the KRaft Controller, a change in the consumer's metric would have a wider impact. WDYT @hachikuji @jsancio ?

@srdo
Copy link
Contributor Author

srdo commented Nov 11, 2024

One concern I have is unlike the KRaft metric, which is limited to the KRaft Controller, a change in the consumer's metric would have a wider impact

While this is true, I think it's also worth keeping in mind that the metric that exists today is so far off correct that it's not very helpful.

In the example of an even split between time inside and outside poll, the current metric will report 33%. That's very far from accurate.

Comment on lines 95 to 100
if (this.pollEndMs != 0) {
long pollIntervalMs = pollStartMs - this.pollStartMs;
long pollTimeMs = this.pollEndMs - this.pollStartMs;
double pollIdleRatio = pollTimeMs * 1.0 / pollIntervalMs;
this.pollIdleSensor.record(pollIdleRatio);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method would benefit from some comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add some, though I'll likely move these updates into recordPollEnd along with changing the polling interval to be based on the end timestamps instead of the start timestamps.

The poll(0ms) case you're thinking of I assume is regarding division by 0 here?

Comment on lines 3103 to 3104
// This also means the metric will always exclude the latest poll, since we don't know how much time is spent outside poll for that interval
// until poll is called again
Copy link
Contributor

@ahuang98 ahuang98 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this not be an issue if we instead calculated
(pollEnd - pollStart) / (pollEnd - lastPollEnd)

this would mean our metric includes the latest poll, because it updates at the end of the poll

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would probably work. I'll try changing it.

Copy link
Contributor Author

@srdo srdo Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this, but the consequence is that we will exclude the first poll rather than the last poll.

The issue is essentially that this metric is measuring a property of the interval between two polls, which means you're going to have to either exclude either the first or the last poll depending on whether you choose to start intervals at the start or end of the poll.

If you define the poll interval as being from start to start (as I did here), that means the last poll is excluded because until the poll after that begins, the interval for the last poll has no endpoint.

If you define the poll interval as being from end to end (as you suggest), the first poll is excluded, because there is no starting point for that interval.

Here's a small illustration:

Picking the start time of the poll call as the boundary of the interval, the poll intervals for 3 sequential completed polls look like this:

[start1, start2], [start2, start3], [start3, ?].

By using these boundaries, each poll interval starts when poll is called, and ends when the subsequent poll is called. The time you spent outside of poll in an interval is the time after poll returns.

The third interval has an unknown boundary because that boundary point will be the start time of the fourth poll, and that poll hasn't started yet. This means we have to wait to record that last interval until then, which means the last poll is not included in the metric.

If we instead pick the end time as the boundary, we get these intervals:

[?, end1], [end1, end2], [end2, end3]

By using these boundaries, each poll interval starts when the previous poll returns, and ends when the current poll returns. The time you spent outside of poll in a specific interval is the time before you called poll.

The first interval has an unknown boundary, because there is no "previous poll" to get an end point from, and so that interval will be excluded from the metric.

I don't know which is preferable, they seem pretty equal to me. Do you have an opinion @ahuang98?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, I do see you decided to try implementing TimeRatio so I'll hold off on continuing this thread

@@ -3098,31 +3098,53 @@ public void testPollIdleRatio(GroupProtocol groupProtocol) {
assertEquals(Double.NaN, consumer.metrics().get(pollIdleRatio).metricValue());

// 1st poll
// Spend 50ms in poll so value = 1.0
// Spend 50ms in poll. value=NaN because "the fraction of time spent inside poll" is undefined until the polling interval has an end point,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the first poll, maybe worth adding an additional sleep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add that. Is it just because the starting time is never 0 in production, or is there something else you want to cover by doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this increment, but I don't understand what it's supposed to demonstrate. If you let me know what your reasoning is, I can put that in as a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize there isn't a benefit to adding the additional sleep, recordPollStart (at least at that commit) doesn't factor in time before the first pollStart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll remove the sleep again then

@ahuang98
Copy link
Contributor

Apologies for the late reply. I stumbled upon a similar metric in KafkaRaftMetrics which was modified in #13207. Perhaps we should consider a similar logic for parity? One concern I have is unlike the KRaft metric, which is limited to the KRaft Controller, a change in the consumer's metric would have a wider impact. WDYT @hachikuji @jsancio ?

Interesting, I had a read of #13207 and I would agree that it would make sense to have parity here (given that this PR has been open for some time already, seems worth it to take the additional time to get the ratio right).
Having the poll(0ms) case tested would be great

Turn timeSinceLastPollMs into a local

Add some comments to recordPollStart

Make recordPollStart handle polling intervals of 0 ms by ignoring them,
rather than letting the metric value become NaN

Try to improve the explanation in KafkaConsumerTest of how this metric is
intended to work, and why it excludes the last poll.

Move time increments in the test around a bit. A polling interval
runs from poll start to poll start, which means that the time
in an interval we consider to be "outside of poll" is the time after
the poll completes, until the next poll starts. This means the test
is easier to read if we increment the time after each poll,
rather than before each poll.

Add bit to the test that verifies the 0ms interval handling correctly ignores
such intervals.
@srdo
Copy link
Contributor Author

srdo commented Dec 22, 2024

@ahuang98 I have addressed your comments in the latest commit. Please take another look when you get a chance.

As outlined in #17139 (comment), I don't think there is any benefit to choosing to measure the intervals between poll endpoints instead of poll start points. We will have to exclude a poll from the metric either way, the only difference is whether it's the first or last poll.

So I've stuck to measuring intervals from poll start to poll start for now.

@srdo
Copy link
Contributor Author

srdo commented Dec 22, 2024

Regarding #13207, I think that is likely a better solution than the changes made here, since this suffers from exactly the problem described in that PR.

Happy to update this to use TimeRatio instead, if people agree that this is the way to go.

I have some minor concerns about that class though:

  • It's trying to give a more accurate metric by not collapsing each poll interval into a percentage. Instead, it determines the measurement interval by resetting when measure is called. If I understand the metrics code correctly, that means metrics reporters can "interfere" in the sense that the width of these windows will depend on how many metrics reporters you have, and how often they run.
  • I think measure is called by the metrics reporters, and measure mutates the state of TimeRatio. Doesn't that mean the internals of TimeRatio need to be made thread safe?
  • Since the calls to measure determine when TimeRatio resets the interval it's tracking, that means poll intervals can intersect with the measurement interval boundary, which means TimeRatio has to throw some values away (e.g. because a poll started before the current measurement interval but completed within it).
  • It could be slightly misleading to insert "fake" data points with a defaultRatio, rather than more clearly indicating that the value is not valid with e.g. NaN or -1 or something like that.

I'm unsure if these matter in practice.

The issue TimeRatio is trying to solve is that we're collapsing a bunch of not-equally-large time intervals into ratios and then averaging those ratios, which is wrong.

Would we be better off with a metric that makes the windows fixed-width?

If polling intervals are defined as running from the start of a poll until the start of the next poll, we know that the non-polling time in an interval "comes after" the polling time in that interval.

I think that would let us define fixed-width samples that can be averaged without losing precision. When we hit the first pollStart that exceeds the target time window, we can "backfill" time in the previous window because we know what kind of time is missing from it.

Example:

Say the window size is 10s

Poll 2s
Non-poll 2s
Poll 0s
Non-poll 0s
Poll 7s
Non-poll 2s

Say that we end up calling pollStart at t=13s

When we call pollStart, the counters which track all but the most recent poll would say we've spent 2s in poll and 2s outside poll. We'd know that we've accounted for 4/10 seconds.

We can then look at the latest poll and "fill in" the interval, drawing first from polling seconds (since we know poll "comes before" non-poll), and next from non-polling seconds to hit the exact window size.

In this case we're missing 6 seconds, so we pull that from the poll bucket, and the final numbers for the interval are 2+6 seconds in poll, and 2 seconds outside poll, meaning we'd log a sample ratio of 0.8.

The final 1 polling second and 2 non-poll seconds roll over to the next window. We'd start the next window at the end of the previous window, and add the remaning seconds to it. So the next interval in this case starts at 10s, and starts out with 1s spent polling and 2s spent not polling.

If the polling interval is much larger than the window size, we can apply the rollover method repeatedly. For example, if the window size is 10s and a poll interval takes 62s, a pollStart call can just submit 6 samples for the 6 intervals covered since the last pollStart, and carry over the last few seconds into the next interval.

I think this would let us accurately track the average ratio, without too many edge cases. The metrics reporters would just use Avg and SampledStat to get their value, like they do now. This means reporters can't interfere with each other. The samples for the fixed-size intervals would "age out" the same way samples do today.

This is to demonstrate the idea, needs polish before merging.
@github-actions github-actions bot removed the small Small PRs label Dec 26, 2024
@srdo
Copy link
Contributor Author

srdo commented Dec 26, 2024

I've implemented the above idea in the latest commit, as input to the discussion. It's not quite mergeable, but it should be good enough to show that the idea can work. Rather than doing approximation like TimeRatio, it instead divides time into fixed-size windows, and submits one sample to the metric per window.

I believe this fixes the issue TimeRatio is also trying to solve, which is that it's bad to do averages over time intervals that are variable size. TimeRatio fixes it by not doing an average and instead ending each sampling window when a reporter asks for a value. This commit fixes it by making the intervals all equal in size.

I think the benefits over TimeRatio is that this metric is exact and doesn't have any edge cases related to 0-wide polling intervals, nor does it have any potential threading issues or impact from having multiple reporters looking at the metric.

The drawback is that the metric will be slightly behind, since the measured time within a sampling window is only recorded in the metric once the window closes. For example, if you have a sampling window size of 10 seconds, the metric will be showing values for polls between 0 and 10 seconds in the past, depending on how far through the window you are.

If we decide that we'd like to go with this type of metric over the TimeRatio approximation, I'd be happy to replace TimeRatio with the new fixed-window version, and adapt the tests as necessary.

Otherwise, let me know and I can make this metric use the current version of TimeRatio instead.

@@ -68,6 +67,7 @@ public KafkaConsumerMetrics(Metrics metrics, String metricGrpPrefix) {
metricGroupName,
"The average fraction of time the consumer's poll() is idle as opposed to waiting for the user code to process records."),
new Avg());
this.pollIdleRatio = new TimeRatio2(metrics.config().timeWindowMs(), pollIdleSensor);
Copy link
Contributor Author

@srdo srdo Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure whether it would be better to select a smaller time window here. The defaults for the metrics system is a window of 30 seconds, which is a bit large for this purpose, and it also only keeps 2 samples. It might make sense to aim for e.g. a sample every few seconds, or 1/5th the window size the metric system uses, or something like that.

@ahuang98
Copy link
Contributor

  • Since the calls to measure determine when TimeRatio resets the interval it's tracking, that means poll intervals can intersect with the measurement interval boundary, which means TimeRatio has to throw some values away (e.g. because a poll started before the current measurement interval but completed within it).

I see, this was hard to visualize, but is the issue essentially - if poll() was called at time=10, TimeRatio.measure() called at time=20, and poll had not completed yet, the 10 ongoing seconds of polling time will not accounted for in the calculated ratio. It is actually treated as non-polling time as well?

@ahuang98
Copy link
Contributor

  • It's trying to give a more accurate metric by not collapsing each poll interval into a percentage. Instead, it determines the measurement interval by resetting when measure is called. If I understand the metrics code correctly, that means metrics reporters can "interfere" in the sense that the width of these windows will depend on how many metrics reporters you have, and how often they run.
  • I think measure is called by the metrics reporters, and measure mutates the state of TimeRatio. Doesn't that mean the internals of TimeRatio need to be made thread safe?

I don't have enough knowledge with how folks generally configure metric reporters to comment on this, but these concerns seem reasonable. I'll see if I can get a second opinion on this

I am having a hard time following along with your example - specifically starting at this point

We can then look at the latest poll and "fill in" the interval, drawing first from polling seconds (since we know poll "comes before" non-poll), and next from non-polling seconds to hit the exact window size.

What I've gleaned is that with a window size of 10s, we are able to calculate the ratio for window 1 (which ends during the 6th second of the third poll) and window 2 is partial and includes everything after that.

@srdo
Copy link
Contributor Author

srdo commented Jan 14, 2025

I see, this was hard to visualize, but is the issue essentially - if poll() was called at time=10, TimeRatio.measure() called at time=20, and poll had not completed yet, the 10 ongoing seconds of polling time will not accounted for in the calculated ratio. It is actually treated as non-polling time as well?

Thanks for reviewing.

Yes. Polling time can essentially "leak" as long as a poll starts before the measurement window, and finishes during the window.

Example:
Assume we've called measure at t=0, which initialized the measurement window.
Call poll at t=10
Call measure at t=20. This reports 0 in the metric as the poll isn't done yet.
Complete the poll at t=30
Call measure at t=31.

We now have 20 ms of polling time inside an 11 ms measurement window. This causes the metric to report 1.0 and reset the counters to 0.

Those remaining 9 ms of polling time have "leaked" at this point, because they're simply lost from the tracking. The next time we call measure, it will not include those 9 ms, because we discarded them when resetting the counters.

A way to visualize this is that if we have any incomplete polls when we began a measurement window, then when the measurement window ends, we will try to "squeeze in" the pre-window-start polling time by pretending it actually occurred within the measurement window. If there isn't room for that time inside the window, we'll truncate it and report 1.0 for that window.

But either way, moving the time to the measurement window is a bit incorrect. If it doesn't fit, we lose time. If it does fit, we report a lower idle time for the window than we really should, because we included polling time in the window that actually didn't belong in that window.

That being said, the inaccuracy this introduces is probably minor as long as the measurement windows are large in comparison to the time it takes to run poll, since if you have very short polls and very long windows, only a small fraction of the window's tracked polling time can be coming from before the window began.

@srdo
Copy link
Contributor Author

srdo commented Jan 14, 2025

I am having a hard time following along with your example - specifically starting at this point

We can then look at the latest poll and "fill in" the interval, drawing first from polling seconds (since we know poll "comes before" non-poll), and next from non-polling seconds to hit the exact window size.

What I've gleaned is that with a window size of 10s, we are able to calculate the ratio for window 1 (which ends during the 6th second of the third poll) and window 2 is partial and includes everything after that.

Yes, exactly. The basic idea is to let each window be flushed to the metric as soon as we know we've accounted for all seconds from that window. Since windows are fixed-size, we can know when a window is ready to flush.

So in the example, once we finish the third poll, we have enough seconds to complete window 1, because we have 2 seconds from the first poll, 2 non-polling seconds, and then 6 polling seconds from the third poll.

The last second from the third poll then gets "saved up" for the next window which isn't complete yet. The same thing happens to the two non-polling seconds from the third poll. At some future point, we'll complete polls enough to get the remaining 7 seconds, and then window 2 can be flushed.

In the implementation the point where we account for the latest poll's seconds and consider whether to flush the open window is in pollStart.

@ahuang98
Copy link
Contributor

Hm, I take back at least this portion of my statement

It is actually treated as non-polling time as well?

It looks like measure ignores currentTimestamp, so it will exclude the current partial poll time from both the divisor and dividend.

double intervalDurationMs = Math.max(lastRecordedTimestampMs - intervalStartTimestampMs, 0);

@ahuang98
Copy link
Contributor

ahuang98 commented Jan 14, 2025

Complete the poll at t=30
Call measure at t=31.
We now have 20 ms of polling time inside an 11 ms measurement window. This causes the metric to report 1.0 and reset the counters to 0.

[edited] I see, I wrote the following test for understanding if partial polls would be accounted for correctly (and it seems to be the case) but I just duplicated your scenario and it doesn't seem to behave nicely (just as you mentioned)

    @Test
    public void testPartialPoll() {
        MetricConfig config = new MetricConfig();
        MockTime time = new MockTime();
        TimeRatio ratio = new TimeRatio(1.0);

        ratio.record(config, 0.0, time.milliseconds());
        time.sleep(20);
        ratio.record(config, 10, time.milliseconds());

        time.sleep(10);
        // pretend a poll() starts at this point
        time.sleep(10);

        // We ignore the last 20ms that lapsed after the last record() was called. Cumulative value captured by calls
        // to record since last call to measure is 10, lastRecordedTimestampMs - intervalStartTimestampMs = 20s - 0s.
        assertEquals(0.5, ratio.measure(config, time.milliseconds()));

        // Now the current poll() finishes.
        ratio.record(config, 10, time.milliseconds());
        // Cumulative value captured by calls to record since last call to measure is 10,
        // lastRecordedTimestampMs - intervalStartTimestampMs = 40s - 20s.
        assertEquals(0.5, ratio.measure(config, time.milliseconds()));

        // If we had calculated the ratio over the entire window of the test (40s), the ratio would have been
        // (10 + 10) / 40 = 0.5.
    }

@ahuang98
Copy link
Contributor

ahuang98 commented Jan 14, 2025

Hm, okay your edge case only happens on the call to measure after the first call to record. All following measurements shouldn't have this issue of 'leaking' poll time (because lastRecordedTimestampMs will not be -1)

e.g.

That being said, the inaccuracy this introduces is probably minor as long as the measurement windows are large in comparison to the time it takes to run poll, since if you have very short polls and very long windows, only a small fraction of the window's tracked polling time can be coming from before the window began.

Only on startup would we see at most two measurement windows with an inflated value (first window will always report 1.0, and the window where the poll ended due to the edge case you found) if poll time greatly surpasses the measurement window.

@srdo
Copy link
Contributor Author

srdo commented Jan 14, 2025

You are completely correct. My mistake. TimeRatio can't "lose time" the way I thought. The scenario I described was wrong, I misunderstood the code.

So if the two other concerns mentioned above (thread safety, the effect of multiple metrics reporters) turn out to not be relevant, I guess TimeRatio is fine as is?

Not sure if this matters in practice, so maybe just for fun: Here's another example demonstrating funny metric outputs:

    @Test
    public void testPartialPoll() {
        MetricConfig config = new MetricConfig();
        MockTime time = new MockTime();
        TimeRatio ratio = new TimeRatio(1.0);

        ratio.record(config, 0.0, time.milliseconds());
        time.sleep(10);
        // Pretend poll starts here
        time.sleep(10);

        //This emits the default value because the poll isn't done, so the current window has size 0
        assertEquals(1.0, ratio.measure(config, time.milliseconds()));

        time.sleep(10);
        ratio.record(config, 20, time.milliseconds());

        // This is technically the correct value, but this measurement ends up covering the entire time interval
        // and we're still left with a meaningless 1.0 metric value above.
        // When a user looks at this in their metrics, they'll see 1.0 followed by 0.66, which is pretty hard to interpret.
        assertEquals(2 / 3.0, ratio.measure(config, time.milliseconds()));
        
        // To demonstrate that this can be repeated
        // Pretend another poll starts here
        time.sleep(10);
        // We again get a meaningless 1.0 metric since the window is 0 in size
        assertEquals(1.0, ratio.measure(config, time.milliseconds()));
        time.sleep(10);
        ratio.record(config, 10, time.milliseconds());
        // followed by the "real" value in this call
        assertEquals(0.5, ratio.measure(config, time.milliseconds()));
    }

In this test, the user will observe the following emitted metric values: 1.0, 0.66, 1.0, 0.5, even though two of those values are meaningless. This happens if measure is called twice in a row without a completed poll between them.

@ahuang98
Copy link
Contributor

(haven't read your last example yet, but will today) I think your concern about multithreading behavior is still valid though, if we do decide to change how we calculate poll ratios, perhaps it may make sense to introduce a KIP to get more eyes on the implementation (since this may impact the values of both the consumer and raft poll metric)

@ahuang98
Copy link
Contributor

Basically the metric can oscillate if the reporting window is consistently skewed with when polls end. I think the same issue could occur even with pre-determined window sizes right?

@srdo
Copy link
Contributor Author

srdo commented Jan 14, 2025

The example I posted above isn't showing a real oscillation, it's showing that we might report the metric default value (in the example, this is 1.0) if measure is called twice without a poll completing between the two calls. The 1.0 values aren't "real", they're placeholder values inserted because measure must return some value when called.

@ahuang98
Copy link
Contributor

@srdo what did you think about the KIP? Your point about the multithreading issue is valid, and since fixing TimeRatio would impact the metric reporting for the existing idle ratio metric it would be good to get more eyes on your proposed solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants