[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

lhotari · 2023-12-20T13:53:55Z

Motivation

The current ConsumerBase.clearIncomingMessage has a race condition when using Shared or Key_Shared subscription type.

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerBase.java

Lines 1224 to 1229 in 69a45a1

    
           protected void clearIncomingMessages() { 
        
               // release messages if they are pooled messages 
        
               incomingMessages.forEach(Message::release); 
        
               incomingMessages.clear(); 
        
               resetIncomingMessageSize(); 
        
           }

When using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed.

Modifications

Pull messages from the queue and release them one by one.
Return the number of messages that were removed.

Documentation

doc
doc-required
doc-not-needed
doc-complete

…y while new entries are added - when using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed

codecov-commenter · 2023-12-20T20:15:41Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (69a45a1) 73.43% compared to head (7f4bd25) 73.40%.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #21767      +/-   ##
============================================
- Coverage     73.43%   73.40%   -0.03%     
+ Complexity    32798    32772      -26     
============================================
  Files          1897     1897              
  Lines        140647   140646       -1     
  Branches      15489    15491       +2     
============================================
- Hits         103290   103248      -42     
- Misses        29283    29316      +33     
- Partials       8074     8082       +8

Flag	Coverage Δ
inttests	`24.19% <34.37%> (+0.02%)`	⬆️
systests	`24.79% <34.37%> (+0.03%)`	⬆️
unittests	`72.69% <84.37%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...a/org/apache/pulsar/client/impl/RawReaderImpl.java	`83.65% <100.00%> (ø)`
...va/org/apache/pulsar/client/impl/ConsumerBase.java	`73.83% <100.00%> (-0.30%)`	⬇️
...he/pulsar/client/impl/MultiTopicsConsumerImpl.java	`77.87% <100.00%> (-0.71%)`	⬇️
...ache/pulsar/client/impl/ZeroQueueConsumerImpl.java	`66.25% <50.00%> (+0.81%)`	⬆️
...va/org/apache/pulsar/client/impl/ConsumerImpl.java	`78.00% <78.94%> (-0.02%)`	⬇️

... and 53 files with indirect coverage changes

codelipenghui · 2023-12-21T22:52:57Z

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

lhotari · 2023-12-22T06:03:18Z

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

@codelipenghui I'll try to do that. However, it's pretty clear from the code that there are multiple race conditions that this PR would address.
One of the races is caused by the NoOpLock used for shared subscription.

lhotari · 2023-12-22T18:50:22Z

One issue that I was able to reproduce, I haven't yet checked whether this PR fixes the problem. The repro app might be about a different issue.

A very messy repro app that does nasty things: https://github.com/lhotari/pulsar-playground/blob/lh-PR21767-investigation/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueRedeliveries.java

repro:

git clone -b lh-PR21767-investigation https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# start Pulsar in docker in the background
docker run --rm --name pulsar-standalone -d -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.1.1 /pulsar/bin/pulsar standalone -nss -nfw
# wait a few seconds for Pulsar to start
sleep 5
# run the test app using the shadow jar with all dependencies
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries
# stop and remove Pulsar in docker
docker stop pulsar-standalone

All messages should be eventually received. In the test case, about 5 to 15 out of 10000 are usually lost.

The test app prints how many are remaining (lost)

[main] INFO com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Done receiving. Remaining: 10 duplicates: 8704 reconsumed: 985
[main] ERROR com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Not all messages received. Remaining: 10

UPDATE: This issue reproduces even with the changes in this PR, so it's not fixed by this PR 21767.

I made changes to the test app build so that a shadow jar can be built with the locally built snapshot version of the Pulsar client.

# build the local version of Pulsar with this command before building the shadow jar:
# mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true`
# then build the shadow jar, check the previous instructions how to build
PULSAR_VERSION=3.2.0-SNAPSHOT ./gradlew shadowJar

UPDATE 2: The test case was bad. I had forgotten DLQ config with max redeliveries 5 and that caused the problem.

lhotari · 2023-12-22T20:16:44Z

UPDATE 3 about the repro app:

When I increase the number of messages to 1M, the processing gets stuck in a loop where it's only the redelivered messages that keep on rotating. That is expected in many ways when there's a frequent call to consumer.redeliverUnacknowledgedMessages(). Calling that method is causing issues. I wonder how many Pulsar applications in the wild contain this type of mistake?

Could we improve the documentation for redeliverUnacknowledgedMessages to reduce confusion? What is a valid use case to use this method? Is it really needed in the user level API?

Since failover/shared subscription types are essentially calling consumer.redeliverUnacknowledgedMessages() under the covers, it's possible that a Pulsar application gets into this vicious cycle where the number of failed messages is very high and very little progress, or in worst case no progress, will be made between the redelivery calls. There is the consumer epoch solution to help with the problem, but I could see how it doesn't help prevent the situation when there's a large amount of messages that are in redelivery and it keeps the loop between the client and the broker busy.
It seems that such a problem could be detected from the relivery counters in topic stats, but that's something that isn't visible in Grafana / Prometheus metrics directly, IIRC.

In most cases the situation would get resolved after a while, but only after burning a lot of CPU cycles and transferred bytes across the network. I wonder if there's a better way to deal with high error rates in Pulsar?

lhotari added type/bug The PR fixed a bug or issue reported a bug ready-to-test labels Dec 20, 2023

lhotari added this to the 3.2.0 milestone Dec 20, 2023

lhotari requested review from rdhabalia and Technoboy- December 20, 2023 13:53

lhotari self-assigned this Dec 20, 2023

lhotari requested a review from codelipenghui December 20, 2023 13:54

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 20, 2023

lhotari requested a review from poorbarcode December 20, 2023 13:54

lhotari added 2 commits December 20, 2023 16:04

Ensure that messages are properly released before returning from method

756a8fb

lhotari force-pushed the lh-fix-memory-leak-clearIncomingMessages branch from 8e27b6c to 756a8fb Compare December 20, 2023 14:17

lhotari added 2 commits December 20, 2023 19:57

Remove unacked messages one by one to keep consistency

03b1922

Extract a method "discardIncomingMessage" and use it

7f4bd25

lhotari requested a review from merlimat December 20, 2023 18:39

lhotari marked this pull request as draft December 22, 2023 06:04

Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023

coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024

lhotari mentioned this pull request Jun 3, 2024

[Bug] Acknowledgements lost on bookies and brokers restart resulting in messages not being delivered #22709

Closed

3 tasks

lhotari mentioned this pull request Aug 9, 2024

[Bug] Bundle unload can cause shared consumer receive duplicate messages #23056

Open

3 tasks

lhotari modified the milestones: 4.0.0, 4.1.0 Oct 11, 2024

lhotari added the release/4.0.1 label Oct 11, 2024

lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label Oct 14, 2024

lhotari added release/4.0.2 and removed release/4.0.1 labels Dec 6, 2024

lhotari added release/4.0.3 and removed release/4.0.2 labels Jan 6, 2025

lhotari added release/4.0.4 and removed release/4.0.3 labels Feb 19, 2025

lhotari added release/4.0.5 and removed release/4.0.4 labels Mar 31, 2025

lhotari added release/4.0.6 and removed release/4.0.5 labels May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

Uh oh!

lhotari commented Dec 20, 2023

Uh oh!

codecov-commenter commented Dec 20, 2023

Uh oh!

codelipenghui commented Dec 21, 2023

Uh oh!

lhotari commented Dec 22, 2023

Uh oh!

lhotari commented Dec 22, 2023 •

edited

Loading

Uh oh!

lhotari commented Dec 22, 2023

Uh oh!

Uh oh!

	protected void clearIncomingMessages() {
	// release messages if they are pooled messages
	incomingMessages.forEach(Message::release);
	incomingMessages.clear();
	resetIncomingMessageSize();
	}

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

Are you sure you want to change the base?

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added #21767

Uh oh!

Conversation

lhotari commented Dec 20, 2023

Motivation

Modifications

Documentation

Uh oh!

codecov-commenter commented Dec 20, 2023

Codecov Report

Uh oh!

codelipenghui commented Dec 21, 2023

Uh oh!

lhotari commented Dec 22, 2023

Uh oh!

lhotari commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Dec 22, 2023

Uh oh!

Uh oh!

lhotari commented Dec 22, 2023 •

edited

Loading