[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

congbobo184 · 2023-04-06T15:07:30Z

Master Issue:
Fixes client redeliver epoch bigger than broker consumer epoch.
Now redeliver method exists above race condition:

consumer reconnects to the broker with the epoch (1)
client consumer invokes redeliver command with epoch (2) and find the consumer doesn't connect, so ignore this redeliver command
the result is broker send message with epoch(1), client will filter these message

Motivation

fix this issue

Modifications

client sends redeliver command don't check the consumer state only check the cnx to see whether been set to the client
broker consumer future complete, then process the redeliver command, if complete exception don't need to handle, because conusmer will reconnect with the epoch

Verifying this change

add test for it

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:
congbobo184#16

…epoch

github-actions · 2023-05-07T01:58:22Z

The pr had no activity for 30 days, mark with Stale label.

…r-epoch-consume-stuck-problem

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

…roblem

lhotari

LGTM

lhotari · 2024-10-14T06:49:29Z

closing and reopening to trigger CI

…r-epoch-consume-stuck-problem

lhotari

Please check the question about updating the consumer epoch

...va/org/apache/pulsar/broker/service/persistent/PersistentDispatcherSingleActiveConsumer.java

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java

lhotari

Thanks for the explanations. Now I see the problem. It seems that there's a broader possibility for race conditions in all redeliverUnacknowledgedMessages methods. I think that this needs a different approach for fixing this and testing this issue.
There are multiple chances for the race to happen. I'll think of possible solutions.
The problem in the test case added in this PR is that it's a synthetic test where the internal state is modified. Instead of doing that, it would be better to have a way to introduce the race condition by having a way to inject a delay in the client side connection logic so that the race condition actually happens. I'll follow up with more details, possibly after experimenting on this.

lhotari · 2024-11-06T10:23:56Z

@congbobo184 To me it seems that the problem could be prevented by changing the this line

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java

Line 898 in 8eeb0e2

conf.getSubscriptionProperties(), CONSUMER_EPOCH.get(this));

I'm assuming that instead of passing CONSUMER_EPOCH.get(this), using DEFAULT_CONSUMER_EPOCH would solve the issue. It might also require changes so that DEFAULT_CONSUMER_EPOCH would always redeliver messages on the broker side. The client side seems to already always accept DEFAULT_CONSUMER_EPOCH.

Do you agree that this would solve the problem?

congbobo184 · 2024-11-06T11:28:48Z

@congbobo184 To me it seems that the problem could be prevented by changing the this line

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java

Line 898 in 8eeb0e2

conf.getSubscriptionProperties(), CONSUMER_EPOCH.get(this));

I'm assuming that instead of passing CONSUMER_EPOCH.get(this), using DEFAULT_CONSUMER_EPOCH would solve the issue. It might also require changes so that DEFAULT_CONSUMER_EPOCH would always redeliver messages on the broker side. The client side seems to already always accept DEFAULT_CONSUMER_EPOCH.
Do you agree that this would solve the problem?

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

lhotari · 2024-11-06T11:52:10Z

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

…r-epoch-consume-stuck-problem

congbobo184 · 2024-11-11T03:55:53Z

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

Could you please find a similar test for me? I'm not very good at writing it.

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

lhotari · 2024-11-11T18:52:58Z

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

I agree that it's partially a different problem. However since this PR changes the behavior around it, I think that it would make sense to address the permit issue in this PR. Solving the issue will require a few lines of code.

lhotari · 2024-11-11T18:56:34Z

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

Could you please find a similar test for me? I'm not very good at writing it.

I don't have a good example in mind. It might require a broader effort in introducing good ways to test race conditions in the client code. Optimally, the implementation would contain injection points for adding such delays.

congbobo184 · 2024-11-20T09:27:56Z

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

I agree that it's partially a different problem. However since this PR changes the behavior around it, I think that it would make sense to address the permit issue in this PR. Solving the issue will require a few lines of code.

again I see the code, "increaseAvailablePermits" is only relevant to "incomingMessages", no need to care about redeliverCommand success or failure.

…r-epoch-consume-stuck-problem

[fix][client] Fix client redeliver epoch bigger than broker consumer …

c24bb3b

…epoch

congbobo184 requested review from Technoboy-, shibd and gaoran10 April 6, 2023 15:07

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 6, 2023

congbobo184 requested review from poorbarcode and liangyepianzhou April 6, 2023 15:08

fix log

5673942

congbobo184 self-assigned this Apr 7, 2023

github-actions bot added the Stale label May 7, 2023

Merge remote-tracking branch 'apache/master' into congbo/fix/redelive…

ec4be0d

…r-epoch-consume-stuck-problem

lhotari requested review from codelipenghui, BewareMyPower and michaeljmarshall May 10, 2023 10:19

lhotari reviewed May 10, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Show resolved Hide resolved

Technoboy- added this to the 3.2.0 milestone Jul 31, 2023

Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023

coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024

Merge branch 'master' into congbo/fix/redeliver-epoch-consume-stuck-p…

9e7e1af

…roblem

lhotari added category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Oct 9, 2024

lhotari approved these changes Oct 14, 2024

View reviewed changes

lhotari closed this Oct 14, 2024

lhotari reopened this Oct 14, 2024

lhotari added 2 commits October 14, 2024 09:50

Merge remote-tracking branch 'origin/master' into congbo/fix/redelive…

fe8677a

…r-epoch-consume-stuck-problem

Fix import

ef862ae

Improve log message

6fb95e4

lhotari requested changes Oct 14, 2024

View reviewed changes

...va/org/apache/pulsar/broker/service/persistent/PersistentDispatcherSingleActiveConsumer.java Show resolved Hide resolved

lhotari reviewed Oct 14, 2024

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java Show resolved Hide resolved

lhotari modified the milestones: 4.0.0, 4.1.0 Oct 14, 2024

lhotari added triage/lhotari/important lhotari's triaging label for important issues or PRs and removed release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Oct 14, 2024

lhotari requested changes Nov 6, 2024

View reviewed changes

Merge remote-tracking branch 'apache/master' into congbo/fix/redelive…

e6f4e9a

…r-epoch-consume-stuck-problem

congbobo184 added 2 commits November 20, 2024 17:28

fix some test

ddc7def

Merge remote-tracking branch 'apache/master' into congbo/fix/redelive…

3ee01db

…r-epoch-consume-stuck-problem

lhotari added the ready-to-test label Nov 29, 2024

lhotari closed this Nov 29, 2024

lhotari reopened this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

Uh oh!

congbobo184 commented Apr 6, 2023 •

edited

Loading

Uh oh!

github-actions bot commented May 7, 2023

Uh oh!

Uh oh!

lhotari left a comment

Uh oh!

lhotari commented Oct 14, 2024

Uh oh!

lhotari left a comment

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment

Uh oh!

lhotari commented Nov 6, 2024 •

edited

Loading

Uh oh!

congbobo184 commented Nov 6, 2024 •

edited

Loading

Uh oh!

lhotari commented Nov 6, 2024 •

edited

Loading

Uh oh!

congbobo184 commented Nov 11, 2024 •

edited

Loading

Uh oh!

lhotari commented Nov 11, 2024

Uh oh!

lhotari commented Nov 11, 2024

Uh oh!

congbobo184 commented Nov 20, 2024

Uh oh!

Uh oh!

[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

Are you sure you want to change the base?

[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

Uh oh!

Conversation

congbobo184 commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

github-actions bot commented May 7, 2023

Uh oh!

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Oct 14, 2024

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

congbobo184 commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

congbobo184 commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Nov 11, 2024

Uh oh!

lhotari commented Nov 11, 2024

Uh oh!

congbobo184 commented Nov 20, 2024

Uh oh!

Uh oh!

congbobo184 commented Apr 6, 2023 •

edited

Loading

lhotari commented Nov 6, 2024 •

edited

Loading

congbobo184 commented Nov 6, 2024 •

edited

Loading

lhotari commented Nov 6, 2024 •

edited

Loading

congbobo184 commented Nov 11, 2024 •

edited

Loading