-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767
Conversation
…y while new entries are added - when using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed
8e27b6c
to
756a8fb
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #21767 +/- ##
============================================
- Coverage 73.43% 73.40% -0.03%
+ Complexity 32798 32772 -26
============================================
Files 1897 1897
Lines 140647 140646 -1
Branches 15489 15491 +2
============================================
- Hits 103290 103248 -42
- Misses 29283 29316 +33
- Partials 8074 8082 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed. |
@codelipenghui I'll try to do that. However, it's pretty clear from the code that there are multiple race conditions that this PR would address. |
One issue that I was able to reproduce, I haven't yet checked whether this PR fixes the problem. The repro app might be about a different issue. A very messy repro app that does nasty things: https://github.com/lhotari/pulsar-playground/blob/lh-PR21767-investigation/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueRedeliveries.java repro:
All messages should be eventually received. In the test case, about 5 to 15 out of 10000 are usually lost. The test app prints how many are remaining (lost)
UPDATE: This issue reproduces even with the changes in this PR, so it's not fixed by this PR 21767. I made changes to the test app build so that a shadow jar can be built with the locally built snapshot version of the Pulsar client.
UPDATE 2: The test case was bad. I had forgotten DLQ config with max redeliveries 5 and that caused the problem. |
UPDATE 3 about the repro app: When I increase the number of messages to 1M, the processing gets stuck in a loop where it's only the redelivered messages that keep on rotating. That is expected in many ways when there's a frequent call to consumer.redeliverUnacknowledgedMessages(). Calling that method is causing issues. I wonder how many Pulsar applications in the wild contain this type of mistake? Could we improve the documentation for redeliverUnacknowledgedMessages to reduce confusion? What is a valid use case to use this method? Is it really needed in the user level API? Since failover/shared subscription types are essentially calling In most cases the situation would get resolved after a while, but only after burning a lot of CPU cycles and transferred bytes across the network. I wonder if there's a better way to deal with high error rates in Pulsar? |
Motivation
The current ConsumerBase.clearIncomingMessage has a race condition when using Shared or Key_Shared subscription type.
pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerBase.java
Lines 1224 to 1229 in 69a45a1
When using Shared or Key_Shared, this method is called in
redeliverUnacknowledgedMessages
while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed.Modifications
Documentation
doc
doc-required
doc-not-needed
doc-complete