Skip to content

KAFKA-14830: Illegal state error in transactional producer #17022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
7522421
WIP
kirktrue Aug 16, 2024
26c6eae
Updates to tests and a first attempt at "fixing" the problem
kirktrue Aug 28, 2024
4be2627
Clean up
kirktrue Aug 28, 2024
9cc5688
Removed temporary logging
kirktrue Aug 28, 2024
11d98a3
Pulled out isStaleBatch for clarity
kirktrue Aug 31, 2024
8f73d09
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Oct 17, 2024
729f5dc
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Jan 16, 2025
7d82a63
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Mar 31, 2025
31e3a82
Merge branch 'apache:trunk' into KAFKA-14830-prevent-illegal-state-tr…
kirktrue Apr 2, 2025
569467f
Inlining isStaleBatch
kirktrue Apr 19, 2025
0567155
Update TransactionManager.java
kirktrue Apr 19, 2025
e2871f8
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Apr 22, 2025
54097fa
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue May 16, 2025
ba1cc86
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Jun 5, 2025
53ffde5
Compare producer IDs in batch and ProducerIdAndEpoch to determine sta…
kirktrue Jun 9, 2025
998787c
Update clients/src/main/java/org/apache/kafka/clients/producer/intern…
kirktrue Jun 9, 2025
a8763ac
Reflow of the comment
kirktrue Jun 9, 2025
1428d80
Reverted previous change; now only considering as stale if epochs are…
kirktrue Jun 10, 2025
22d04d6
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Jun 12, 2025
da4da13
Merge branch 'trunk' into KAFKA-14830-prevent-illegal-state-transitions
kirktrue Jun 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -737,14 +737,21 @@ public synchronized void maybeTransitionToErrorState(RuntimeException exception)
}

synchronized void handleFailedBatch(ProducerBatch batch, RuntimeException exception, boolean adjustSequenceNumbers) {
maybeTransitionToErrorState(exception);
if (!isStaleBatch(batch) && !hasFatalError())
maybeTransitionToErrorState(exception);

removeInFlightBatch(batch);

if (hasFatalError()) {
log.debug("Ignoring batch {} with producer id {}, epoch {}, and sequence number {} " +
"since the producer is already in fatal error state", batch, batch.producerId(),
batch.producerEpoch(), batch.baseSequence(), exception);
return;
} else if (isStaleBatch(batch)) {
log.debug("Ignoring stale batch {} with producer id {}, epoch {}, and sequence number {} " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder if this should be TRACE? It seems not to be important enough for DEBUG (not worried about volume, as it should only happen rarely -- just wondering about importance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, @mjsax.

IMO, TRACE is only for developers. No organization will set logging to TRACE in production, at least not long enough to hit this issue and see the log. But honestly, the same is true for DEBUG 🤔

Can we leave at DEBUG and let end users tell us to turn it down?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work for me to leave a DEBUG -- was just wondering -- if it's a volume problem (what I don't expect), people could disable DEBUG for this class until we fix it.

"since the producer has been re-initialized with producer id {} and epoch {}", batch, batch.producerId(),
batch.producerEpoch(), batch.baseSequence(), producerIdAndEpoch.producerId, producerIdAndEpoch.epoch, exception);
return;
}

if (exception instanceof OutOfOrderSequenceException && !isTransactional()) {
Expand Down Expand Up @@ -772,6 +779,14 @@ synchronized void handleFailedBatch(ProducerBatch batch, RuntimeException except
}
}

/**
* Returns {@code true} if the given {@link ProducerBatch} has the same producer ID but a different epoch than the
* {@link #producerIdAndEpoch cached producer ID and epoch}.
*/
synchronized boolean isStaleBatch(ProducerBatch batch) {
return batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch;
}

synchronized boolean hasInflightBatches(TopicPartition topicPartition) {
return txnPartitionMap.getOrCreate(topicPartition).hasInflightBatches();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import org.apache.kafka.common.Node;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.compress.Compression;
import org.apache.kafka.common.errors.DisconnectException;
import org.apache.kafka.common.errors.FencedInstanceIdException;
import org.apache.kafka.common.errors.GroupAuthorizationException;
import org.apache.kafka.common.errors.InvalidTxnStateException;
Expand Down Expand Up @@ -3948,6 +3949,44 @@ public void testTransactionAbortableExceptionInAddOffsetsToTxn() {
assertAbortableError(TransactionAbortableException.class);
}

@Test
public void testBatchesReceivedAfterAbortableError() {
doInitTransactions();
transactionManager.beginTransaction();

ProducerBatch batch = writeIdempotentBatchWithValue(transactionManager, tp1, "first");

// The producer's connection to the broker is tenuous, so this mimics the catch block for ApiException in
// KafkaProducer.doSend().
transactionManager.maybeTransitionToErrorState(new DisconnectException("test"));

// The above error is bubbled up to the user who then aborts the transaction...
TransactionalRequestResult result = transactionManager.beginAbort();

// The transaction manager handles the abort internally and re-initializes the epoch
short bumpedEpoch = epoch + 1;
prepareInitPidResponse(Errors.NONE, false, producerId, bumpedEpoch);
runUntil(result::isCompleted);

// This mimics a slower produce response that receives the timeout on the client after the above rollback
// has completed. The failed batch should not attempt to change the state since it's stale.
transactionManager.handleFailedBatch(batch, new TimeoutException(), false);
}

@Test
public void testBatchesReceivedAfterFatalError() {
doInitTransactions();
transactionManager.beginTransaction();

ProducerBatch batch = writeIdempotentBatchWithValue(transactionManager, tp1, "first");

// This mimics something that causes the transaction manager to enter its FATAL_ERROR state.
transactionManager.transitionToFatalError(Errors.PRODUCER_FENCED.exception());

// However, even with this failure, the failed batch should not attempt to update to ABORTABLE_ERROR.
transactionManager.handleFailedBatch(batch, new TimeoutException(), false);
}

@Test
public void testTransactionAbortableExceptionInTxnOffsetCommit() {
final TopicPartition tp = new TopicPartition("foo", 0);
Expand Down
Loading