Skip to content

[fix][client] Fix client redeliver epoch bigger than broker consumer epoch #20032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

congbobo184
Copy link
Contributor

@congbobo184 congbobo184 commented Apr 6, 2023

Master Issue:
Fixes client redeliver epoch bigger than broker consumer epoch.
Now redeliver method exists above race condition:

  1. consumer reconnects to the broker with the epoch (1)
  2. client consumer invokes redeliver command with epoch (2) and find the consumer doesn't connect, so ignore this redeliver command
  3. the result is broker send message with epoch(1), client will filter these message

Motivation

fix this issue

Modifications

  1. client sends redeliver command don't check the consumer state only check the cnx to see whether been set to the client
  2. broker consumer future complete, then process the redeliver command, if complete exception don't need to handle, because conusmer will reconnect with the epoch

Verifying this change

add test for it

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:
congbobo184#16

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 6, 2023
@congbobo184 congbobo184 self-assigned this Apr 7, 2023
@github-actions
Copy link

github-actions bot commented May 7, 2023

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label May 7, 2023
@Technoboy- Technoboy- added this to the 3.2.0 milestone Jul 31, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@lhotari lhotari added category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Oct 9, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari
Copy link
Member

lhotari commented Oct 14, 2024

closing and reopening to trigger CI

@lhotari lhotari closed this Oct 14, 2024
@lhotari lhotari reopened this Oct 14, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the question about updating the consumer epoch

@lhotari lhotari modified the milestones: 4.0.0, 4.1.0 Oct 14, 2024
@lhotari lhotari added triage/lhotari/important lhotari's triaging label for important issues or PRs and removed release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Oct 14, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. Now I see the problem. It seems that there's a broader possibility for race conditions in all redeliverUnacknowledgedMessages methods. I think that this needs a different approach for fixing this and testing this issue.
There are multiple chances for the race to happen. I'll think of possible solutions.
The problem in the test case added in this PR is that it's a synthetic test where the internal state is modified. Instead of doing that, it would be better to have a way to introduce the race condition by having a way to inject a delay in the client side connection logic so that the race condition actually happens. I'll follow up with more details, possibly after experimenting on this.

@lhotari
Copy link
Member

lhotari commented Nov 6, 2024

@congbobo184 To me it seems that the problem could be prevented by changing the this line

conf.getSubscriptionProperties(), CONSUMER_EPOCH.get(this));

I'm assuming that instead of passing CONSUMER_EPOCH.get(this), using DEFAULT_CONSUMER_EPOCH would solve the issue. It might also require changes so that DEFAULT_CONSUMER_EPOCH would always redeliver messages on the broker side. The client side seems to already always accept DEFAULT_CONSUMER_EPOCH.

Do you agree that this would solve the problem?

@congbobo184
Copy link
Contributor Author

congbobo184 commented Nov 6, 2024

@congbobo184 To me it seems that the problem could be prevented by changing the this line

conf.getSubscriptionProperties(), CONSUMER_EPOCH.get(this));

I'm assuming that instead of passing CONSUMER_EPOCH.get(this), using DEFAULT_CONSUMER_EPOCH would solve the issue. It might also require changes so that DEFAULT_CONSUMER_EPOCH would always redeliver messages on the broker side. The client side seems to already always accept DEFAULT_CONSUMER_EPOCH.
Do you agree that this would solve the problem?

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

@lhotari
Copy link
Member

lhotari commented Nov 6, 2024

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

@congbobo184
Copy link
Contributor Author

congbobo184 commented Nov 11, 2024

DEFAULT_CONSUMER_EPOCH as seem as CONSUMER_EPOCH.get(this), we need to solves the reconnect and redeliver race condition. Using DEFAULT_CONSUMER_EPOCH seems to make the problem more complicated

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

Could you please find a similar test for me? I'm not very good at writing it.

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

@lhotari
Copy link
Member

lhotari commented Nov 11, 2024

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

I agree that it's partially a different problem. However since this PR changes the behavior around it, I think that it would make sense to address the permit issue in this PR. Solving the issue will require a few lines of code.

@lhotari
Copy link
Member

lhotari commented Nov 11, 2024

Ok, I can now see that the synchronized block is there to prevent the race in increasing the epoch. I hope that we'd have a proper test for the race by injecting a delay to have a real race in the test.

Could you please find a similar test for me? I'm not very good at writing it.

I don't have a good example in mind. It might require a broader effort in introducing good ways to test race conditions in the client code. Optimally, the implementation would contain injection points for adding such delays.

@congbobo184
Copy link
Contributor Author

There's also another issue with the permits. increaseAvailablePermits should only be called if writing of the redeliverUnacknowledgedMessages command succeeds. That should be done in the promise callback of writeAndFlush to ensure that permits aren't increased in the case where the connection is not available.

yes, you are right. but this is a another problem, I think this pr don't need to handle this situation

I agree that it's partially a different problem. However since this PR changes the behavior around it, I think that it would make sense to address the permit issue in this PR. Solving the issue will require a few lines of code.

again I see the code, "increaseAvailablePermits" is only relevant to "incomingMessages", no need to care about redeliverCommand success or failure.

@lhotari lhotari closed this Nov 29, 2024
@lhotari lhotari reopened this Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost doc-not-needed Your PR changes do not impact docs ready-to-test Stale triage/lhotari/important lhotari's triaging label for important issues or PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants