-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032
Conversation
The pr had no activity for 30 days, mark with Stale label. |
…r-epoch-consume-stuck-problem
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
closing and reopening to trigger CI |
…r-epoch-consume-stuck-problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the question about updating the consumer epoch
...va/org/apache/pulsar/broker/service/persistent/PersistentDispatcherSingleActiveConsumer.java
Show resolved
Hide resolved
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations. Now I see the problem. It seems that there's a broader possibility for race conditions in all redeliverUnacknowledgedMessages methods. I think that this needs a different approach for fixing this and testing this issue.
There are multiple chances for the race to happen. I'll think of possible solutions.
The problem in the test case added in this PR is that it's a synthetic test where the internal state is modified. Instead of doing that, it would be better to have a way to introduce the race condition by having a way to inject a delay in the client side connection logic so that the race condition actually happens. I'll follow up with more details, possibly after experimenting on this.
@congbobo184 To me it seems that the problem could be prevented by changing the this line pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java Line 898 in 8eeb0e2
I'm assuming that instead of passing CONSUMER_EPOCH.get(this) , using DEFAULT_CONSUMER_EPOCH would solve the issue. It might also require changes so that DEFAULT_CONSUMER_EPOCH would always redeliver messages on the broker side. The client side seems to already always accept DEFAULT_CONSUMER_EPOCH.
Do you agree that this would solve the problem? |
DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated |
Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test. There's also another issue with the permits. |
…r-epoch-consume-stuck-problem
Could you please find a similar test for me? I'm not very good at writing it.
yes, you are right. but this is a another problem, I think this pr don't need to handle this situation |
I agree that it's partially a different problem. However since this PR changes the behavior around it, I think that it would make sense to address the permit issue in this PR. Solving the issue will require a few lines of code. |
I don't have a good example in mind. It might require a broader effort in introducing good ways to test race conditions in the client code. Optimally, the implementation would contain injection points for adding such delays. |
again I see the code, "increaseAvailablePermits" is only relevant to "incomingMessages", no need to care about redeliverCommand success or failure. |
…r-epoch-consume-stuck-problem
Master Issue:
Fixes client redeliver epoch bigger than broker consumer epoch.
Now redeliver method exists above race condition:
Motivation
fix this issue
Modifications
Verifying this change
add test for it
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository:
congbobo184#16