Skip to content

KAFKA-17871: avoid blocking the herder thread when producer flushing hangs #18142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

davide-armand
Copy link

The call to backingStore.get() (called by connector task threads through OffsetStorageReaderImpl.offsets()) can block for long time waiting for data flush to complete (KafkaProducer.flush()).

This change moves that call outside the synchronized clause that holds offsetReadFutures, so that if backingStore.get() hangs then it does not keep offsetReadFutures locked. The access to closed flag (closed.get()) is kept inside the synchronize clause to avoid race condition with close().

This is important because OffsetStorageReaderImpl.close() needs to lock offsetReadFutures as well in order to cancel the futures.
Since the herder thread calls OffsetStorageReaderImpl.close() when attempting to stops a task, before this change this was resulting in the herder thread hanging indefinetely waiting for backingStore.get() to complete.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@jlprat
Copy link
Contributor

jlprat commented Dec 12, 2024

Hi @davide-armand
CI is complaining about check style violations. Could you take a look at it?
Thanks

@davide-armand davide-armand force-pushed the KAFKA-17871-avoid-blocking-herder-thread branch from 1563874 to fd1bddc Compare December 13, 2024 08:19
@davide-armand
Copy link
Author

The CI is now green.
The additional commit that fixed the CI should be squashed before merging.

Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added stale Stale PRs and removed needs-attention stale Stale PRs labels Mar 19, 2025
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davide-armand Thanks for the fix, and thanks for the test!

Are you able to remove the Thread.sleep calls from the test? These could introduce some flakiness later if run on slow machines. Additionally I think this test leaks threads through the executors. Could you ensure that the executors are closed, and the threads terminate?

@github-actions github-actions bot removed needs-attention triage PRs from the community labels Mar 28, 2025
…hangs

The call to `backingStore.get()` (called by connector task threads through
`OffsetStorageReaderImpl.offsets()`) can block for long time waiting for data flush to complete
(`KafkaProducer.flush()`).

This change moves that call outside the synchronized clause that holds `offsetReadFutures`,
so that if `backingStore.get()` hangs then it does not keep `offsetReadFutures` locked.
The access to `closed` flag (`closed.get()`) is kept inside the synchronize clause to avoid race
condition with `close()`.

This is important because `OffsetStorageReaderImpl.close()` needs to lock `offsetReadFutures` as
well in order to cancel the futures.
Since the herder thread calls `OffsetStorageReaderImpl.close()` when attempting to stops a task,
before this change this was resulting in the herder thread hanging indefinetely waiting for
`backingStore.get()` to complete.
@davide-armand davide-armand force-pushed the KAFKA-17871-avoid-blocking-herder-thread branch from 6d22bc4 to 07f465e Compare April 1, 2025 05:42
@davide-armand davide-armand force-pushed the KAFKA-17871-avoid-blocking-herder-thread branch from 07f465e to b90bf7a Compare April 1, 2025 21:07
@davide-armand
Copy link
Author

Hi @davide-armand Thanks for the fix, and thanks for the test!

Are you able to remove the Thread.sleep calls from the test? These could introduce some flakiness later if run on slow machines. Additionally I think this test leaks threads through the executors. Could you ensure that the executors are closed, and the threads terminate?

@gharris1727 Pushed a fix in a separate commit, will squash before merging.

I cannot get a green build, if I'm not mistaken it fails on unrelated tests.
Should I keep restarting the CI hoping for a green build?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants