Skip to content

[fix][broker] Fix ack hole in cursor for geo-replication #20931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 10, 2024

Conversation

massakam
Copy link
Contributor

@massakam massakam commented Aug 4, 2023

Motivation

Occasionally there is an ack hole in the cursor for geo-replication. The following is the internal stats for the topic where the problem occurred:

{
  "entriesAddedCounter" : 11000,
  "numberOfEntries" : 6999,
  "totalSize" : 362285,
  "currentLedgerEntries" : 6999,
  "currentLedgerSize" : 362285,
  "lastLedgerCreatedTimestamp" : "2023-07-30T16:17:10.137+09:00",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "1687807:6998",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 1687807,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false,
    "underReplicated" : false
  } ],
  "cursors" : {
    "pulsar.repl.cluster-a" : {
      "markDeletePosition" : "1687807:6002",
      "readPosition" : "1687807:6999",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 10942,
      "cursorLedger" : 1687808,
      "cursorLedgerLastEntry" : 13,
      "individuallyDeletedMessages" : "[(1687807:6060..1687807:6998]]",
      "lastLedgerSwitchTimestamp" : "2023-07-30T16:17:10.143+09:00",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 997,
      "totalNonContiguousDeletedMessagesRange" : 1,
      "properties" : { }
    }
  },
  "schemaLedgers" : [ ],
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false,
    "underReplicated" : false
  }
}

Also, the following log was printed on the broker server. The ack hole is included in the range where the cursor was rewound.

16:50:11.787 [pulsar-io-24-1] INFO  o.a.b.mledger.impl.ManagedCursorImpl - [massakam/test/persistent/t1-pulsar.repl.cluster-a] Rewind from 1687807:6061 to 1687807:5999

This problem occurred in the following situations:

  • There are two replication clusters, cluster-a and cluster-b
  • cluster-a usually has no producers or consumers, but once a day a producer connects and publishes messages
  • Only consumers are connected to cluster-b
  • Retention time is 0

In the above case, the producer for geo-replication on the cluster-a side will be closed after a certain period of time by GC.

However, at this time, an already triggered operation to read new entries will not be cancelled. This operation will remain pending until new entries are available.

Then 24 hours later the user's producer connects again and publishes messages. This triggers the pending operation and causes the replicator to start reading new entries.

However, since the producer for geo-replication has not yet been restarted, these read entries will be dropped without being acknowledged.

if (STATE_UPDATER.get(this) != State.Started || isLocalMessageSkippedOnce) {
// The producer is not ready yet after having stopped/restarted. Drop the message because it will
// recovered when the producer is ready
if (log.isDebugEnabled()) {
log.debug("[{}] Dropping read message at {} because producer is not ready",
replicatorId, entry.getPosition());
}
isLocalMessageSkippedOnce = true;
entry.release();
msg.recycle();
continue;
}

On the other hand, since the user's producer is connected, the producer for geo-replication is also restarted and the cursor is rewound. After that, the state of the replicator is changed to Started.

// Rewind the cursor to be sure to read again all non-acked messages sent while restarting
cursor.rewind();
cursor.cancelPendingReadRequest();
HAVE_PENDING_READ_UPDATER.set(this, FALSE);
this.producer = (ProducerImpl) producer;
if (STATE_UPDATER.compareAndSet(this, State.Starting, State.Started)) {

At this time, one of the operations triggered before the cursor is rewound succeeds, causing readPosition to move to "the position next to the successfully read entry". Entries before this position will not be read again. As a result, entries that have been read once but not acknowledged will be left as an ack hole.

In short, a race condition between "cursor rewinding when the producer for geo-replication is restarted" and "an read operation that was triggered the last time geo-replication occurred" is what causes this issue.

Modifications

Add a flag named waitForCursorRewinding to the PersistentReplicator class. Normally this value is false. If this value becomes true, the replicator will no longer call cursor.asyncReadEntriesOrWait.

On the other hand, set waitForCursorRewinding to true at the beginning of the readEntries method that is executed when restarting the producer for geo-replication. Then wait until at least one of the following conditions is met:

  • state is no longer Starting
    • This means the replicator has been stopped
  • havePendingRead becomes FALSE
    • This means there are no reads in progress
  • cursor.cancelPendingReadRequest returns true
    • This means that there was a read in progress, but it was successfully canceled

Then change state to Started, rewind the cursor, and set waitForCursorRewinding back to false. This prevents a read triggered before the cursor has been rewound from advancing the cursor again, leaving an ack hole.

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@massakam massakam added type/bug The PR fixed a bug or issue reported a bug area/broker doc-not-needed Your PR changes do not impact docs ready-to-test labels Aug 4, 2023
@massakam massakam added this to the 3.2.0 milestone Aug 4, 2023
@massakam massakam self-assigned this Aug 4, 2023
@poorbarcode poorbarcode self-requested a review August 4, 2023 12:10
@poorbarcode
Copy link
Contributor

poorbarcode commented Aug 4, 2023

@massakam

Nice catch.

I think this issue is the same as the scenario-3 in the Motivation of the PR #20128

I think the PIP-269 can solve this issue.

@massakam
Copy link
Contributor Author

massakam commented Aug 7, 2023

@poorbarcode Thank you for the information. This issue seems to apply to scenario-1 instead of scenario-3.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Sep 7, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@lhotari
Copy link
Member

lhotari commented Oct 9, 2024

@massakam Is this PR still needed? If so, please rebase. Otherwise, please close.

@massakam massakam force-pushed the fix-replicator-ack-hole branch from 1566e01 to 5ac00d8 Compare October 10, 2024 06:10
@massakam massakam removed the Stale label Oct 10, 2024
@massakam massakam force-pushed the fix-replicator-ack-hole branch from 5ac00d8 to 56f5fd9 Compare October 10, 2024 07:41
@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 46.66667% with 8 lines in your changes missing coverage. Please review.

Project coverage is 74.35%. Comparing base (bbc6224) to head (56f5fd9).
Report is 661 commits behind head on master.

Files with missing lines Patch % Lines
...roker/service/persistent/PersistentReplicator.java 46.66% 6 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #20931      +/-   ##
============================================
+ Coverage     73.57%   74.35%   +0.78%     
- Complexity    32624    34892    +2268     
============================================
  Files          1877     1949      +72     
  Lines        139502   146880    +7378     
  Branches      15299    16171     +872     
============================================
+ Hits         102638   109215    +6577     
- Misses        28908    29250     +342     
- Partials       7956     8415     +459     
Flag Coverage Δ
inttests 27.34% <26.66%> (+2.76%) ⬆️
systests 24.42% <0.00%> (+0.09%) ⬆️
unittests 73.70% <46.66%> (+0.86%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...roker/service/persistent/PersistentReplicator.java 67.72% <46.66%> (-1.16%) ⬇️

... and 623 files with indirect coverage changes

@massakam
Copy link
Contributor Author

@lhotari

@poorbarcode said that PIP-269 could solve this issue, but unfortunately PIP-269 doesn't seem to be progressing. So this issue is not yet resolved and this PR is still needed. I have resolved the conflict, so please review if possible.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @massakam

@lhotari lhotari merged commit 9f8b4a6 into apache:master Oct 10, 2024
52 checks passed
@massakam massakam deleted the fix-replicator-ack-hole branch October 10, 2024 10:08
hanmz pushed a commit to hanmz/pulsar that referenced this pull request Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs ready-to-test type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants