KAFKA-20131: ClassicKafkaConsumer does not clear endOffsetRequested flag on failed LIST_OFFSETS calls by kirktrue · Pull Request #21457 · apache/kafka

kirktrue · 2026-02-11T19:03:16Z

Updates the ClassicKafkaConsumer to clear out the SubscriptionState
endOffsetRequested flag if the LIST_OFFSETS call fails.

Reviewers: Viktor Somogyi-Vass viktorsomogyi@gmail.com, Lianet Magrans
lmagrans@confluent.io, Andrew Schofield aschofield@confluent.io

…ST_OFFSETS call fails First pass at catching the case of failures. Still work to do to handle the fact that multiple responses are possible and thus we don't want to clear the flag prematurely.

viktorsomogyi

@kirktrue went through the code, debugged it locally and ran the added tests, to me the PR looks fine. Since I'm not up to date with the consumer code though, I would request you to get a second opinion too.

…T_OFFSETS-failures

lianetm

Thanks for the fix @kirktrue ! Initial high level comment regarding the changes related to the retry logic

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

lianetm

Some more comments. Thanks!

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcherUtils.java

lianetm · 2026-02-12T20:32:39Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java

                        remainingToSearch.keySet().retainAll(value.partitionsToRetry);

                        offsetFetcherUtils.updateSubscriptionState(value.fetchedOffsets, isolationLevel);
+                        offsetFetcherUtils.clearPartitionEndOffsetRequests(remainingToSearch.keySet());


here we're clearing the flag for the partitions that didn't get offsets yet. I agree we need this if we don't have any time left to retry. But if there's still time, the do-while will try again. In that case, do we want to clear the flag here?

I would imagine we don't, because we'll continue retrying while there is time. It could be the case of missing leader info for instance: we want to keep the flag on for those partitions, hit the client.awaitMetadataUpdate(timer) below, and try again in the next iteration of the do-while, right?

If so, I imagine we could take the timer into consideration here? (clear the flag for the failed partitions only if timer expired?). Thoughts?

I agree we need this if we don't have any time left to retry. But if there's still time, the do-while will try again. In that case, do we want to clear the flag here?

That's precisely what happens in the currentLag() case, though. It's always using a timeout of 0, so there's never a second pass in that loop.

ok, we both agree we need it for currentLag/timerExpired. But in the way it's called now it applies to all cases, that's my concern. Isn't this going to clear the flag also in the case where there is time left to retry, and there is a partition that didn't have a known leader?

I've added an explicit parameter to 'clear end offsets requests' that only the ClassicKafkaConsumer.currentLag() sets to true. This should prevent other callers from clearing the flag, regardless of the timeout setting.

lianetm · 2026-02-12T20:33:04Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java


                @Override
                public void onFailure(RuntimeException e) {
+                    offsetFetcherUtils.clearPartitionEndOffsetRequests(remainingToSearch.keySet());


same as above

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

…T_OFFSETS-failures

…ocations in KafkaConsumerTest

lianetm · 2026-02-13T21:21:43Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java


                        offsetFetcherUtils.updateSubscriptionState(value.fetchedOffsets, isolationLevel);
+
+                        if (isZeroTimestamp && shouldClearPartitionEndOffsets)


shouldn't we clear the flag if (isZeroTimestamp)?

I see the shouldClearPartitionEndOffsets is passed true only from currentLag, but what about a call to consumer.endOffsets/beginningOffsets/offsetsForTimes when called with Duration.ZERO? how would the flag get cleared?

The flag is only set and only checked on the currentLag() path, so the other paths shouldn't need to worry about it.

It's possible for a user to pass in a zero timeout to offsetsForTimes(), et al., but we don't need to clear the flag in those cases.

ack, makes sense that we cannot consolidate on the time check because it depends on the caller.

But then, shouldn't we consolidate on the shouldClearPartitionEndOffsets? Why do we need to check isZeroTimestamp here? vs simply if shouldClearPartitionEndOffsets then clear

In that method, if isZeroTimestamp is set to true, it's a synonym for 'only execute a single pass of the loop?' In the case where isZeroTimestamp is false, it it's 'possibly execute the pass multiple times.'

We don't want to clear the partitionEndOffsetsRequested until we've finished all the passes of the loop we're going to make. So if shouldClearPartitionEndOffsets is true but isZeroTimestamp is false, clearing the partitionEndOffsetsRequested flag could be premature because there could be enough time on the timer for a second pass of the loop that finds the offsets for a partition that the first pass didn't.

I agree that it's confusing. I've tried a couple of different approaches, but they we're much clearer 😦

if shouldClearPartitionEndOffsets is true but isZeroTimestamp is false

This is never true, right? (shouldClear is only used for lag)

This is too twisty I'm afraid. A few points.

I wonder whether it might be better to call subscriptions.requestPartitionEndOffset in this method too. Then you are setting it and clearing it all together.

The flag isZeroTimestamp seems misnamed to me. isZeroTimeout surely.

I think that isZeroTimestamp (and the equivalent timer.timeoutMs() == 0L) has a couple of effects. First, it clears the partition end offset requests flag when the future completes. Second, it exits the loop midway through the first iteration without polling the client.

I would change the check if (timer.timeoutMs() == 0L to use isZeroTimeout too.

I've refactored the currentLag() method in ClassicKafkaConsumer and OffsetFetcher so that the logic resides in the latter. Now OffsetFetcher.fetchOffsetsByTimes() can a) set and clear the partition end offset with much closer locality, and b) revert back to the original logic related to the timeout.

PTAL.

lianetm · 2026-02-16T16:48:52Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java

            }
        } while (timer.notExpired());

+        if (shouldClearPartitionEndOffsets) {


isn't this going to be always false here? (so unneeded?)
shouldClearPartitionEndOffsets is true for currentLag only, so time=0 which always early returns above, right?

You're right. I was trying to keep the logic generalized, but maybe that's only more confusing.

lianetm · 2026-02-16T16:54:14Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java


                        offsetFetcherUtils.updateSubscriptionState(value.fetchedOffsets, isolationLevel);
+
+                        if (isZeroTimestamp && shouldClearPartitionEndOffsets)


ack, makes sense that we cannot consolidate on the time check because it depends on the caller.

But then, shouldn't we consolidate on the shouldClearPartitionEndOffsets? Why do we need to check isZeroTimestamp here? vs simply if shouldClearPartitionEndOffsets then clear

AndrewJSchofield

Thanks for the continued effort on this one. A few comments.

AndrewJSchofield · 2026-02-18T11:07:13Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java


                        offsetFetcherUtils.updateSubscriptionState(value.fetchedOffsets, isolationLevel);
+
+                        if (isZeroTimestamp && shouldClearPartitionEndOffsets)


This is too twisty I'm afraid. A few points.

I wonder whether it might be better to call subscriptions.requestPartitionEndOffset in this method too. Then you are setting it and clearing it all together.

The flag isZeroTimestamp seems misnamed to me. isZeroTimeout surely.

I think that isZeroTimestamp (and the equivalent timer.timeoutMs() == 0L) has a couple of effects. First, it clears the partition end offset requests flag when the future completes. Second, it exits the loop midway through the first iteration without polling the client.

I would change the check if (timer.timeoutMs() == 0L to use isZeroTimeout too.

…T_OFFSETS-failures

…fsetFetcher

…ffsetsByTimes

lianetm

Thanks @kirktrue , nice refactoring ! An important easy-fix gap, and test coverage for it, but other than that seems almost there.

lianetm · 2026-02-19T11:23:01Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java

                                                 Timer timer,
-                                                 boolean requireTimestamps) {
+                                                 boolean requireTimestamps,
+                                                 boolean shouldUpdatePartitionEndOffsets) {


this name is very confusing because it clearly says it will update end offsets (first thing that comes to mind is an actual change to positions, not a flag).

Would it help if we rename to mention it's to update a flag (maybe updatePartitionEndOffsetsFlag), or at least a description of the param?

I changed the variable name to updatePartitionEndOffsetsFlag.

lianetm · 2026-02-19T11:36:57Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java

+        // we may get the answer; we do not need to wait for the return value
+        // since we would not try to poll the network client synchronously
+        if (lag == null) {
+            if (subscriptions.partitionEndOffset(topicPartition, isolationLevel) == null) {


aren't we missing the check here to ensure there is no request in-flight?

We should also ensure we have a test for this: 2 consecutive calls to currentLag at this level, first one should generate a request, no response, second call should not generate the request. I would expect such test should be failing now.

Added a unit test to catch multiple inflight LIST_OFFESTS requests.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

…ag and adding maybeSetPartitionEndOffsetRequest

…refactoring

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java

lianetm

Thanks @kirktrue! LGTM.
I'll let @AndrewJSchofield take a look in case there are more comments before merging.

AndrewJSchofield

lgtm

kirktrue added 4 commits February 5, 2026 15:31

KAFKA-20131: SubscriptionState endOffsetRequested is left set when LI…

8b20d7d

…ST_OFFSETS call fails First pass at catching the case of failures. Still work to do to handle the fact that multiple responses are possible and thus we don't want to clear the flag prematurely.

WIP

af7def2

Updates

e91784c

Clearing partition end offset request flag for new consumer

39bb381

github-actions bot added consumer clients triage PRs from the community labels Feb 11, 2026

kirktrue added 2 commits February 11, 2026 11:31

Comments and minor formatting

49b51b3

More comments

5cda4f8

frankvicky added the ci-approved label Feb 11, 2026

Catching more cases

047d3cf

kirktrue marked this pull request as ready for review February 12, 2026 00:29

viktorsomogyi self-requested a review February 12, 2026 15:26

viktorsomogyi approved these changes Feb 12, 2026

View reviewed changes

Merge branch 'trunk' into KAFKA-20131-clear-endOffsetRequested-on-LIS…

d56e17e

…T_OFFSETS-failures

lianetm reviewed Feb 12, 2026

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java Outdated Show resolved Hide resolved

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java Outdated Show resolved Hide resolved

lianetm reviewed Feb 12, 2026

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java Outdated Show resolved Hide resolved

lianetm reviewed Feb 12, 2026

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java Outdated Show resolved Hide resolved

lianetm removed the triage PRs from the community label Feb 12, 2026

kirktrue added 2 commits February 12, 2026 21:13

Merge branch 'trunk' into KAFKA-20131-clear-endOffsetRequested-on-LIS…

4d7073c

…T_OFFSETS-failures

Fixed typo in log message and verifying requestPartitionEndOffset inv…

15dbff5

…ocations in KafkaConsumerTest

kirktrue changed the title ~~KAFKA-20131: SubscriptionState endOffsetRequested remains permanently set if LIST_OFFSETS call fails~~ KAFKA-20131: ClassicKafkaConsumer does not clear endOffsetRequested flag on failed LIST_OFFSETS calls Feb 13, 2026

kirktrue added 5 commits February 13, 2026 12:30

Updates to remove changes to AsyncKafkaConsumer

a1a71f5

Updates to remove changes to AsyncKafkaConsumer

d393a59

More clean up for PR

820f91e

Removed unnecessary changes to OffsetFetcher

29e8810

Another minor refactor

aedb1b4

lianetm reviewed Feb 13, 2026

View reviewed changes

lianetm reviewed Feb 16, 2026

View reviewed changes

mjsax mentioned this pull request Feb 16, 2026

KAFKA-20131: Clear end-offset request flag after failed list-offsets #21489

Closed

Removed checking for log message output

9ba4c96

AndrewJSchofield requested changes Feb 18, 2026

View reviewed changes

kirktrue added 7 commits February 18, 2026 11:44

Merge branch 'trunk' into KAFKA-20131-clear-endOffsetRequested-on-LIS…

90aa258

…T_OFFSETS-failures

Revising the logic around currentLag() in ClassicKafkaConsumer and Of…

c5d5317

…fsetFetcher

Reverting changes in ApplicationEventProcessor

6c72eb8

Reverted several changes to SubscriptionState

4d0727f

Moved both setting and clearing of partitionEndOffsets flag to fetchO…

ee3399c

…ffsetsByTimes

Minor clean up to reduce diff noise

ba79cf1

Added setPartitionEndOffsetRequests

9fd51f8

lianetm reviewed Feb 19, 2026

View reviewed changes

kirktrue added 3 commits February 19, 2026 10:45

Renamed shouldUpdatePartitionEndOffsets → updatePartitionEndOffsetsFl…

5b13372

…ag and adding maybeSetPartitionEndOffsetRequest

Adding testCurrentLagPreventsMultipleInFlightRequests test and minor …

f3ead35

…refactoring

Minor refactoring of request count check to reformat and include message

6a0613c

lianetm reviewed Feb 19, 2026

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java Show resolved Hide resolved

lianetm approved these changes Feb 20, 2026

View reviewed changes

AndrewJSchofield approved these changes Feb 20, 2026

View reviewed changes

AndrewJSchofield merged commit abcbef6 into apache:trunk Feb 20, 2026
29 checks passed


		offsetFetcherUtils.updateSubscriptionState(value.fetchedOffsets, isolationLevel);

		if (isZeroTimestamp && shouldClearPartitionEndOffsets)

Comments

Conversation

kirktrue commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viktorsomogyi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kirktrue Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

kirktrue commented Feb 11, 2026 •

edited by github-actions bot

Loading

viktorsomogyi left a comment •

edited

Loading

kirktrue Feb 17, 2026 •

edited

Loading