-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-14830: Illegal state error in transactional producer #17022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
KAFKA-14830: Illegal state error in transactional producer #17022
Conversation
clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java
Outdated
Show resolved
Hide resolved
I think the overall approach makes sense. But I would like to see some tests to see if the issue is improved. If so the logging could also give us some more insight. |
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
Still needed, just lower priority 😞 |
cc @k-raina |
@jolshan—The unit tests mimic the use cases that were seen in the wild. What other test cases should we consider? Thanks! |
clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java
Outdated
Show resolved
Hide resolved
clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java
Outdated
Show resolved
Hide resolved
clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java
Show resolved
Hide resolved
maybeTransitionToErrorState(exception); | ||
// Compare the batch with the current ProducerIdAndEpoch. If the producer IDs are the *same* but the epochs | ||
// are *different*, consider the batch as stale. | ||
boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor thing here, if we have an epoch overflow, we won't consider this check. Maybe if the producer ID doesn't match, we also ignore it? Not sure if that has unintended consequences. We could also tackle as a followup, since without this fix, it's the same state as today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something conceptually—when an epoch overflows, does it restart at 0, roll over to negative, or something else?
In the case of an overflow, wouldn't the batch's producer epoch value still differ from the producerIdAndEpoch
’s value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have epoch overflow, we get a new producer id and start epoch at 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the producer ID will not be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if either the producer ID or the epoch of the batch is different than that of the transaction manager, it's "stale?"
boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch; | |
boolean isStaleBatch = batch.producerId() != producerIdAndEpoch.producerId || batch.producerEpoch() != producerIdAndEpoch.epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So I suppose this was being check somewhere else 😅 I suppose we can revert that.
Perhaps it is sufficient to still fail in the edge case where we have epoch overflow for now as it doesn't seem like we will make much progress on finding an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is batch.producerEpoch() != producerIdAndEpoch.epoch
ok? Or should it be batch.producerEpoch() < producerIdAndEpoch.epoch
?
On the other hand, we don't have a check here for batch.producerEpoch() > producerIdAndEpoch.epoch` which should never happen? Should we add such a check (or does it maybe exist somewhere else)? Or do we not care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Receiving a future epoch would seem to imply something was really wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like batch.producerEpoch() < producerIdAndEpoch.epoch
is really more appropriate as a "staleness" check and let the TransactionManager
handle the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible that these stale overflow cases unfortunately may lead to having to close the producer, but I would prefer to look into that as a potential followup. Seems like it would be very rare.
…als/TransactionManager.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
removeInFlightBatch(batch); | ||
|
||
if (hasFatalError()) { | ||
log.debug("Ignoring batch {} with producer id {}, epoch {}, and sequence number {} " + | ||
"since the producer is already in fatal error state", batch, batch.producerId(), | ||
batch.producerEpoch(), batch.baseSequence(), exception); | ||
return; | ||
} else if (isStaleBatch) { | ||
log.debug("Ignoring stale batch {} with producer id {}, epoch {}, and sequence number {} " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder if this should be TRACE? It seems not to be important enough for DEBUG (not worried about volume, as it should only happen rarely -- just wondering about importance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, @mjsax.
IMO, TRACE
is only for developers. No organization will set logging to TRACE
in production, at least not long enough to hit this issue and see the log. But honestly, the same is true for DEBUG
🤔
Can we leave at DEBUG
and let end users tell us to turn it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work for me to leave a DEBUG -- was just wondering -- if it's a volume problem (what I don't expect), people could disable DEBUG for this class until we fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two small follow up questions for my own education.
Hey thanks. I took a step back and I'm wondering if this is the right approach. Let me try to get back to you on this -- I need to get a better idea of the whole flow. The discussions on the check made me realize I was missing something. |
Ok -- I'm back. I think the thing that wasn't sitting right for me, and what I realized from our discussion with the producer ID overflow is whether this is the right place to make the change and the right mental model. Specifically, we don't have a great way to distinguish benign stale requests from those that that could indicate a divergence of state or other real problem. I'm wondering if we can get to the heart of this:
Do we know if there is any way we can ensure the exact inflight requests during the timeout are marked as stale vs just assuming this from the epoch in the batch? I think for some errors, we close the connection -- we may not want to do that here, but thinking about addressing the problem at the root and not adding small changes in ways that we may not realize the side effects for yet. (As we saw in the test failures from making the change, the current part of code may cause other issues) |
When the producer's transaction manager receives a notification that an
error has occurred during a transaction, it takes steps to abort the
transaction and reset its internal state.
Users have reported the following case where a producer experiences
timeouts while in a transaction:
TransactionManager
(TM) starts with stateREADY
and epochset to
0
IN_TRANSACTION
ABORTING_TRANSACTION
0
to1
andsetting its state to
READY
Sender
thread, one of the failed batchescalls
handleFailedBatch()
handleFailedBatch()
sets the TM state toABORTABLE_ERROR
whichis an invalid state transition from
READY
, hence the exceptionThis change compares the transaction manager's current epoch (
1
)with the batch's epoch (
0
). If they're different, the batch isconsidered "stale" and can be ignored (though a
DEBUG
message islogged).