KAFKA-14830: Illegal state error in transactional producer #17022

kirktrue · 2024-08-27T22:35:04Z

When the producer's transaction manager receives a notification that an
error has occurred during a transaction, it takes steps to abort the
transaction and reset its internal state.

Users have reported the following case where a producer experiences
timeouts while in a transaction:

The TransactionManager (TM) starts with state READY and epoch
set to 0
A transaction (T1) begins and TM sets its internal state to
IN_TRANSACTION
Batches are created and sent off to their respective brokers
A timeout threshold is hit
T1 starts the abort process
1. TM state is set to ABORTING_TRANSACTION
2. The batches involved with T1 are marked as expired
3. TM is reinitialized, bumping the epoch from 0 to 1 and
  setting its state to READY
A moment later, in the Sender thread, one of the failed batches
calls handleFailedBatch()
handleFailedBatch() sets the TM state to ABORTABLE_ERROR which
is an invalid state transition from READY, hence the exception

This change compares the transaction manager's current epoch (1)
with the batch's epoch (0). If they're different, the batch is
considered "stale" and can be ignored (though a DEBUG message is
logged).

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

jolshan · 2024-08-28T19:59:44Z

I think the overall approach makes sense. But I would like to see some tests to see if the issue is improved. If so the logging could also give us some more insight.

github-actions · 2025-01-16T03:35:31Z

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

kirktrue · 2025-01-16T18:07:23Z

Still needed, just lower priority 😞

kirktrue · 2025-03-26T19:50:24Z

cc @k-raina

kirktrue · 2025-03-31T20:49:50Z

I think the overall approach makes sense. But I would like to see some tests to see if the issue is improved. If so the logging could also give us some more insight.

@jolshan—The unit tests mimic the use cases that were seen in the wild. What other test cases should we consider? Thanks!

…ansitions

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

jolshan · 2025-05-28T23:08:39Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

-        maybeTransitionToErrorState(exception);
+        // Compare the batch with the current ProducerIdAndEpoch. If the producer IDs are the *same* but the epochs
+        // are *different*, consider the batch as stale.
+        boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch;


one minor thing here, if we have an epoch overflow, we won't consider this check. Maybe if the producer ID doesn't match, we also ignore it? Not sure if that has unintended consequences. We could also tackle as a followup, since without this fix, it's the same state as today.

I think I'm missing something conceptually—when an epoch overflows, does it restart at 0, roll over to negative, or something else?

In the case of an overflow, wouldn't the batch's producer epoch value still differ from the producerIdAndEpoch’s value?

When we have epoch overflow, we get a new producer id and start epoch at 0.

So the producer ID will not be the same.

So if either the producer ID or the epoch of the batch is different than that of the transaction manager, it's "stale?"

Suggested change

boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch;

boolean isStaleBatch = batch.producerId() != producerIdAndEpoch.producerId || batch.producerEpoch() != producerIdAndEpoch.epoch;

Ok. So I suppose this was being check somewhere else 😅 I suppose we can revert that.
Perhaps it is sufficient to still fail in the edge case where we have epoch overflow for now as it doesn't seem like we will make much progress on finding an alternative.

Is batch.producerEpoch() != producerIdAndEpoch.epoch ok? Or should it be batch.producerEpoch() < producerIdAndEpoch.epoch?

On the other hand, we don't have a check here for batch.producerEpoch() > producerIdAndEpoch.epoch` which should never happen? Should we add such a check (or does it maybe exist somewhere else)? Or do we not care?

Receiving a future epoch would seem to imply something was really wrong.

So it sounds like batch.producerEpoch() < producerIdAndEpoch.epoch is really more appropriate as a "staleness" check and let the TransactionManager handle the other cases?

I think it is possible that these stale overflow cases unfortunately may lead to having to close the producer, but I would prefer to look into that as a potential followup. Seems like it would be very rare.

…le batches

…als/TransactionManager.java

jolshan

thanks

… different

kirktrue · 2025-06-11T19:02:47Z

@jolshan @mjsax—Anything else we need to check before we can merge? Thanks!

mjsax · 2025-06-12T00:35:50Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

        removeInFlightBatch(batch);

        if (hasFatalError()) {
            log.debug("Ignoring batch {} with producer id {}, epoch {}, and sequence number {} " +
                            "since the producer is already in fatal error state", batch, batch.producerId(),
                    batch.producerEpoch(), batch.baseSequence(), exception);
            return;
+        } else if (isStaleBatch) {
+            log.debug("Ignoring stale batch {} with producer id {}, epoch {}, and sequence number {} " +


Just wonder if this should be TRACE? It seems not to be important enough for DEBUG (not worried about volume, as it should only happen rarely -- just wondering about importance).

That's a good question, @mjsax.

IMO, TRACE is only for developers. No organization will set logging to TRACE in production, at least not long enough to hit this issue and see the log. But honestly, the same is true for DEBUG 🤔

Can we leave at DEBUG and let end users tell us to turn it down?

Work for me to leave a DEBUG -- was just wondering -- if it's a volume problem (what I don't expect), people could disable DEBUG for this class until we fix it.

mjsax

Just two small follow up questions for my own education.

kirktrue · 2025-06-16T21:02:44Z

@jolshan @mjsax—Another ping to check if we can merge... Thanks!

jolshan · 2025-06-17T16:50:25Z

Hey thanks. I took a step back and I'm wondering if this is the right approach. Let me try to get back to you on this -- I need to get a better idea of the whole flow. The discussions on the check made me realize I was missing something.

jolshan · 2025-06-17T20:28:57Z

Ok -- I'm back. I think the thing that wasn't sitting right for me, and what I realized from our discussion with the producer ID overflow is whether this is the right place to make the change and the right mental model.

Specifically, we don't have a great way to distinguish benign stale requests from those that that could indicate a divergence of state or other real problem.

I'm wondering if we can get to the heart of this:

A timeout threshold is hit

T1 starts the abort process
1. TM state is set to ABORTING_TRANSACTION
2. The batches involved with T1 are marked as expired
3. TM is reinitialized, bumping the epoch from 0 to 1 and
setting its state to READY

A moment later, in the Sender thread, one of the failed batches
calls handleFailedBatch()

Do we know if there is any way we can ensure the exact inflight requests during the timeout are marked as stale vs just assuming this from the epoch in the batch? I think for some errors, we close the connection -- we may not want to do that here, but thinking about addressing the problem at the root and not adding small changes in ways that we may not realize the side effects for yet. (As we saw in the test failures from making the change, the current part of code may cause other issues)

kirktrue added 4 commits August 15, 2024 17:59

WIP

7522421

Updates to tests and a first attempt at "fixing" the problem

26c6eae

Clean up

4be2627

Removed temporary logging

9cc5688

kirktrue added the producer label Aug 28, 2024

jolshan reviewed Aug 28, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java Outdated Show resolved Hide resolved

kirktrue added 2 commits August 30, 2024 17:40

Pulled out isStaleBatch for clarity

11d98a3

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

8f73d09

github-actions bot added clients small Small PRs labels Oct 17, 2024

kirktrue added transactions Transactions and EOS ci-approved labels Oct 17, 2024

github-actions bot added the stale Stale PRs label Jan 16, 2025

kirktrue removed the stale Stale PRs label Jan 16, 2025

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

729f5dc

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

7d82a63

kirktrue marked this pull request as ready for review March 31, 2025 20:47

Merge branch 'apache:trunk' into KAFKA-14830-prevent-illegal-state-tr…

31e3a82

…ansitions

kirktrue changed the title ~~[WIP] KAFKA-14830: Illegal state error in transactional producer~~ KAFKA-14830: Illegal state error in transactional producer Apr 7, 2025

kirktrue requested a review from jolshan April 7, 2025 18:48

mjsax reviewed Apr 17, 2025

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java Outdated Show resolved Hide resolved

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java Outdated Show resolved Hide resolved

kirktrue added 2 commits April 18, 2025 19:16

Inlining isStaleBatch

569467f

Update TransactionManager.java

0567155

kirktrue commented Apr 19, 2025

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java Show resolved Hide resolved

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

e2871f8

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

54097fa

jolshan reviewed May 28, 2025

View reviewed changes

kirktrue added 4 commits June 5, 2025 11:58

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

ba1cc86

Compare producer IDs in batch and ProducerIdAndEpoch to determine sta…

53ffde5

…le batches

Update clients/src/main/java/org/apache/kafka/clients/producer/intern…

998787c

…als/TransactionManager.java

Reflow of the comment

a8763ac

jolshan approved these changes Jun 9, 2025

View reviewed changes

Reverted previous change; now only considering as stale if epochs are…

1428d80

… different

mjsax reviewed Jun 12, 2025

View reviewed changes

kirktrue added 2 commits June 12, 2025 10:15

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

22d04d6

Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions

da4da13

	boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch;
	boolean isStaleBatch = batch.producerId() != producerIdAndEpoch.producerId \|\| batch.producerEpoch() != producerIdAndEpoch.epoch;

KAFKA-14830: Illegal state error in transactional producer #17022

Are you sure you want to change the base?

KAFKA-14830: Illegal state error in transactional producer #17022

Uh oh!

Conversation

kirktrue commented Aug 27, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jolshan commented Aug 28, 2024

Uh oh!

github-actions bot commented Jan 16, 2025

Uh oh!

kirktrue commented Jan 16, 2025

Uh oh!

kirktrue commented Mar 26, 2025

Uh oh!

kirktrue commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jolshan left a comment

Choose a reason for hiding this comment

Uh oh!

kirktrue commented Jun 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

kirktrue commented Jun 16, 2025

Uh oh!

jolshan commented Jun 17, 2025

Uh oh!

jolshan commented Jun 17, 2025

Uh oh!

Uh oh!

kirktrue commented Aug 27, 2024 •

edited by github-actions bot

Loading

kirktrue commented Mar 31, 2025 •

edited

Loading