-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[fix][broker] Fix ack hole in cursor for geo-replication #20931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@poorbarcode Thank you for the information. This issue seems to apply to scenario-1 instead of scenario-3. |
The pr had no activity for 30 days, mark with Stale label. |
@massakam Is this PR still needed? If so, please rebase. Otherwise, please close. |
1566e01
to
5ac00d8
Compare
5ac00d8
to
56f5fd9
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #20931 +/- ##
============================================
+ Coverage 73.57% 74.35% +0.78%
- Complexity 32624 34892 +2268
============================================
Files 1877 1949 +72
Lines 139502 146880 +7378
Branches 15299 16171 +872
============================================
+ Hits 102638 109215 +6577
- Misses 28908 29250 +342
- Partials 7956 8415 +459
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@poorbarcode said that PIP-269 could solve this issue, but unfortunately PIP-269 doesn't seem to be progressing. So this issue is not yet resolved and this PR is still needed. I have resolved the conflict, so please review if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work @massakam
Co-authored-by: Masahiro Sakamoto <[email protected]>
Motivation
Occasionally there is an ack hole in the cursor for geo-replication. The following is the internal stats for the topic where the problem occurred:
Also, the following log was printed on the broker server. The ack hole is included in the range where the cursor was rewound.
This problem occurred in the following situations:
In the above case, the producer for geo-replication on the cluster-a side will be closed after a certain period of time by GC.
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Line 2626 in ca01447
However, at this time, an already triggered operation to read new entries will not be cancelled. This operation will remain pending until new entries are available.
Then 24 hours later the user's producer connects again and publishes messages. This triggers the pending operation and causes the replicator to start reading new entries.
However, since the producer for geo-replication has not yet been restarted, these read entries will be dropped without being acknowledged.
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/GeoPersistentReplicator.java
Lines 138 to 149 in ca01447
On the other hand, since the user's producer is connected, the producer for geo-replication is also restarted and the cursor is rewound. After that, the state of the replicator is changed to
Started
.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentReplicator.java
Lines 138 to 145 in ca01447
At this time, one of the operations triggered before the cursor is rewound succeeds, causing
readPosition
to move to "the position next to the successfully read entry". Entries before this position will not be read again. As a result, entries that have been read once but not acknowledged will be left as an ack hole.In short, a race condition between "cursor rewinding when the producer for geo-replication is restarted" and "an read operation that was triggered the last time geo-replication occurred" is what causes this issue.
Modifications
Add a flag named
waitForCursorRewinding
to thePersistentReplicator
class. Normally this value is false. If this value becomes true, the replicator will no longer callcursor.asyncReadEntriesOrWait
.On the other hand, set
waitForCursorRewinding
to true at the beginning of thereadEntries
method that is executed when restarting the producer for geo-replication. Then wait until at least one of the following conditions is met:state
is no longerStarting
havePendingRead
becomesFALSE
cursor.cancelPendingReadRequest
returns trueThen change
state
toStarted
, rewind the cursor, and setwaitForCursorRewinding
back to false. This prevents a read triggered before the cursor has been rewound from advancing the cursor again, leaving an ack hole.Verifying this change
Documentation
doc
doc-required
doc-not-needed
doc-complete