Skip to content

KAFKA-20114: Fix producer ID retry backoff race#22204

Open
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:KAFKA-20114
Open

KAFKA-20114: Fix producer ID retry backoff race#22204
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:KAFKA-20114

Conversation

@chickenchickenlove
Copy link
Copy Markdown
Contributor

Description

In RPCProducerIDManager, there was a race condition between maybeRequestNextBlock() and handleUnsuccessfulResponse(), which may be called by different threads. This race condition could leat to a premature retry. To fix this issue, this patch reorders the operation in maybeRequestNextBlock().

Considered parts.

  • It is difficult to add a unit test for this diff because the race condition cannot be controlled deterministically without relying on scheduling. Instead of adding a unit test, I added comments to the paired methods to clarify the intended ordering and concurrency assumptions.
  • Race condition between if (nextProducerIdBlock.get() != null)and if (!requestInFlight.compareAndSet(false, true))
    • After if (nextProducerIdBlock.get() != null) is checked, another thread may set nextProducerIdBlock. However, this does not cause a premature retry. In sanityCheckResponse(...), the thread only sets nextProducerIdBlock and does not update requestInFlight. Therefore, CAS in maybeRequestNextBlock()will will fail, and a premature retry will not occur.

@github-actions github-actions Bot added triage PRs from the community transactions Transactions and EOS small Small PRs labels May 4, 2026
@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

Hi, @squah-confluent !
I've opened a PR to fix the issue we discussed earlier.
When you get a chance, could you please take a look? 🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

small Small PRs transactions Transactions and EOS triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant