-
Notifications
You must be signed in to change notification settings - Fork 23
Fix producers reconnection deadlock #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix producers reconnection deadlock #394
Conversation
- Added a new test to validate reconnections on partition close events and catching deadlock - Updated `maybeCleanProducers` to remove unnecessary mutex locks to pevent deadlock
…f dirrect access to struct field
Thank you @yurahaid |
I am doing some testing, and the patch makes sense to me. At the moment, I don't have time. Hope to do it in the future. I will continue with some tests. |
@yurahaid @hiimjako I am still convinced about the PR, but: executed via
|
@hiimjako @Gsantomaggio |
I see two possible solutions for this
Considering the current code in the library, it seems that the simpler and more reliable solution is to use the first option with copying. @hiimjako @Gsantomaggio What do you think? |
Thank you @yurahaid I would vote for |
I changed it to Also, I changed the signatures of the |
Thanks @yurahaid
yes, Coordinator should be only internal.
Not necessarily. I doubt that someone uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mergeable for me now that it runs locally without issues :)
A potential deadlock issue was identified in the function when handling changes in stream metadata. This issue occurs when the stream members are updated (e.g., disabling or modifying replicas). The deadlock caused execution to hang indefinitely, leaving publisher disconnected
How the Issue Was Discovered:
To reproduce the issue:
should reconnect to the same partition after a close event
, was introduced.Impact of the Issue:
When did it happen?
How did the application behave?
Severity Level:
Fix Implemented:
maybeCleanProducers
Step-by-step description of the problem
maybeCleanConsumers
locks client mutex pkg/stream/environment.go:513c.coordinator.RemoveConsumerById
triggers signal to producers closeHandler function pkg/stream/producer.go:641closeHandler
triggerspartitionCloseEvent
that the client can handle and try to reconnect (as in test)event.Context.ConnectPartition
trigger environmentCoordinator mutex.Lock