Skip to content

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Dec 20, 2023

Motivation

The current ConsumerBase.clearIncomingMessage has a race condition when using Shared or Key_Shared subscription type.

protected void clearIncomingMessages() {
// release messages if they are pooled messages
incomingMessages.forEach(Message::release);
incomingMessages.clear();
resetIncomingMessageSize();
}

When using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed.

Modifications

  • Pull messages from the queue and release them one by one.
  • Return the number of messages that were removed.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug ready-to-test labels Dec 20, 2023
@lhotari lhotari added this to the 3.2.0 milestone Dec 20, 2023
@lhotari lhotari self-assigned this Dec 20, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 20, 2023
@lhotari lhotari requested a review from poorbarcode December 20, 2023 13:54
…y while new entries are added

- when using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages
  while new entries are flowing to the client. This would leak memory and memory limit
  counters would get skewed
@lhotari lhotari force-pushed the lh-fix-memory-leak-clearIncomingMessages branch from 8e27b6c to 756a8fb Compare December 20, 2023 14:17
@lhotari lhotari requested a review from merlimat December 20, 2023 18:39
@codecov-commenter
Copy link

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (69a45a1) 73.43% compared to head (7f4bd25) 73.40%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21767      +/-   ##
============================================
- Coverage     73.43%   73.40%   -0.03%     
+ Complexity    32798    32772      -26     
============================================
  Files          1897     1897              
  Lines        140647   140646       -1     
  Branches      15489    15491       +2     
============================================
- Hits         103290   103248      -42     
- Misses        29283    29316      +33     
- Partials       8074     8082       +8     
Flag Coverage Δ
inttests 24.19% <34.37%> (+0.02%) ⬆️
systests 24.79% <34.37%> (+0.03%) ⬆️
unittests 72.69% <84.37%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 83.65% <100.00%> (ø)
...va/org/apache/pulsar/client/impl/ConsumerBase.java 73.83% <100.00%> (-0.30%) ⬇️
...he/pulsar/client/impl/MultiTopicsConsumerImpl.java 77.87% <100.00%> (-0.71%) ⬇️
...ache/pulsar/client/impl/ZeroQueueConsumerImpl.java 66.25% <50.00%> (+0.81%) ⬆️
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 78.00% <78.94%> (-0.02%) ⬇️

... and 53 files with indirect coverage changes

@codelipenghui
Copy link
Contributor

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

@lhotari
Copy link
Member Author

lhotari commented Dec 22, 2023

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

@codelipenghui I'll try to do that. However, it's pretty clear from the code that there are multiple race conditions that this PR would address.
One of the races is caused by the NoOpLock used for shared subscription.

@lhotari lhotari marked this pull request as draft December 22, 2023 06:04
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@lhotari
Copy link
Member Author

lhotari commented Dec 22, 2023

One issue that I was able to reproduce, I haven't yet checked whether this PR fixes the problem. The repro app might be about a different issue.

A very messy repro app that does nasty things: https://github.com/lhotari/pulsar-playground/blob/lh-PR21767-investigation/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueRedeliveries.java

repro:

git clone -b lh-PR21767-investigation https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# start Pulsar in docker in the background
docker run --rm --name pulsar-standalone -d -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.1.1 /pulsar/bin/pulsar standalone -nss -nfw
# wait a few seconds for Pulsar to start
sleep 5
# run the test app using the shadow jar with all dependencies
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries
# stop and remove Pulsar in docker
docker stop pulsar-standalone

All messages should be eventually received. In the test case, about 5 to 15 out of 10000 are usually lost.

The test app prints how many are remaining (lost)

[main] INFO com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Done receiving. Remaining: 10 duplicates: 8704 reconsumed: 985
[main] ERROR com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Not all messages received. Remaining: 10

UPDATE: This issue reproduces even with the changes in this PR, so it's not fixed by this PR 21767.

I made changes to the test app build so that a shadow jar can be built with the locally built snapshot version of the Pulsar client.

# build the local version of Pulsar with this command before building the shadow jar:
# mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true`
# then build the shadow jar, check the previous instructions how to build
PULSAR_VERSION=3.2.0-SNAPSHOT ./gradlew shadowJar

UPDATE 2: The test case was bad. I had forgotten DLQ config with max redeliveries 5 and that caused the problem.

@lhotari
Copy link
Member Author

lhotari commented Dec 22, 2023

UPDATE 3 about the repro app:

When I increase the number of messages to 1M, the processing gets stuck in a loop where it's only the redelivered messages that keep on rotating. That is expected in many ways when there's a frequent call to consumer.redeliverUnacknowledgedMessages(). Calling that method is causing issues. I wonder how many Pulsar applications in the wild contain this type of mistake?

Could we improve the documentation for redeliverUnacknowledgedMessages to reduce confusion? What is a valid use case to use this method? Is it really needed in the user level API?

Since failover/shared subscription types are essentially calling consumer.redeliverUnacknowledgedMessages() under the covers, it's possible that a Pulsar application gets into this vicious cycle where the number of failed messages is very high and very little progress, or in worst case no progress, will be made between the redelivery calls. There is the consumer epoch solution to help with the problem, but I could see how it doesn't help prevent the situation when there's a large amount of messages that are in redelivery and it keeps the loop between the client and the broker busy.
It seems that such a problem could be detected from the relivery counters in topic stats, but that's something that isn't visible in Grafana / Prometheus metrics directly, IIRC.

In most cases the situation would get resolved after a while, but only after burning a lot of CPU cycles and transferred bytes across the network. I wonder if there's a better way to deal with high error rates in Pulsar?

@lhotari lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test release/4.0.5 triage/lhotari/important lhotari's triaging label for important issues or PRs type/bug The PR fixed a bug or issue reported a bug
Development

Successfully merging this pull request may close these issues.

5 participants