Skip to content

maintainer: replay deferred WAITING barrier statuses after dispatcher enters replicating#4808

Open
zier-one wants to merge 1 commit intopingcap:masterfrom
zier-one:20260413-speedup-create-like
Open

maintainer: replay deferred WAITING barrier statuses after dispatcher enters replicating#4808
zier-one wants to merge 1 commit intopingcap:masterfrom
zier-one:20260413-speedup-create-like

Conversation

@zier-one
Copy link
Copy Markdown

@zier-one zier-one commented Apr 13, 2026

What problem does this PR solve?

Issue Number: close #4810

This PR fixes a timing window in the maintainer barrier flow. Before this change, when a non-DDL dispatcher reported a WAITING block status before the maintainer had moved it from scheduling to replicating, the status was ignored immediately. As a result, the barrier could not advance on the first report and had to wait for the dispatcher's local fixed 5s resend task.

This change stores such deferred WAITING statuses inside the maintainer and replays them through the existing barrier state machine after the dispatcher actually becomes replicating, so barrier progress no longer depends on the dispatcher's local 5s resend as the primary recovery path.

What is changed and how it works?

This PR applies to barrier scenarios where a non-DDL dispatcher can observe a barrier before it is officially moved into the replicating set, especially:

  • newly created or recreated dispatchers participating in a barrier for the first time;
  • DDLs such as CREATE TABLE ... LIKE ... that bring referenced-table dispatchers into the same barrier;
  • dispatcher recreation after migration, split, or merge;
  • any DDL / syncpoint path where the first WAITING report can land in the scheduling -> replicating transition window.

Before:

  • the first WAITING report could be ignored if it arrived during the non-replicating window;
  • barrier progress then depended on the dispatcher's local 5s resend task;
  • DDL / syncpoint tail latency could be amplified by that fixed resend interval.
    img_v3_0210n_60180ab7-f08c-4b1a-8949-d3b32cf6249g

After:

  • the maintainer defers and caches such WAITING statuses instead of dropping them;
  • once the dispatcher becomes replicating, the maintainer actively replays the cached status in periodic Barrier.Resend();
  • barrier progress resumes through the existing ACK / write-action path without treating the dispatcher's local 5s resend as the primary compensation path.
    img_v3_0210n_75a151a4-cde6-44c4-b599-5ffd28b5bbdg

Check List

Tests

[x] Unit test

Added coverage for:

  • TestDeferUnreplicatingWaitingStatus
  • TestResendReplaysDeferredWaitingStatusAfterDispatcherReplicating
  • TestResendDropsDeferredWaitingStatusWhenDispatcherMissing
  • TestResendDropsDeferredWaitingStatusWhenDispatcherAlreadyPassed
  • TestResendDropsDeferredWaitingStatusWhenDispatcherMoved

Also re-ran the nearby barrier scheduling regression:

  • TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of status reports during dispatcher state transitions to prevent premature processing and ensure reliable replication coordination.
  • Tests

    • Added comprehensive test coverage for dispatcher status handling during initialization and state changes.

Release note

Improved the expected replication efficiency of `CREATE TABLE ... LIKE ...` by optimizing barrier coordination and the blocking DDL progression path to reduce extra waits introduced by referenced-table dispatchers.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 13, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wk989898 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces a deferred status mechanism for dispatchers not yet replicating. Block status reports arriving while a dispatcher is in non-replicating state with WAITING stage status are buffered and replayed once replication begins, improving consistency in barrier event handling.

Changes

Cohort / File(s) Summary
Core Barrier Logic
maintainer/barrier.go
Added pendingUnreplicatingStatuses field and deferral logic in HandleStatus. New methods tryDeferUnreplicatingWaitingStatus, drainPendingUnreplicatingStatuses, and dispatcherAlreadyPassedPendingState manage deferred statuses. Updated handleOneStatus signature to accept common.ChangeFeedID directly. Deferred statuses are replayed on Resend() with proper ACK/WRITE action generation.
Event Forwarding Helper
maintainer/barrier_event.go
Extracted forwarding decision logic into new replicationPassedBarrier helper method to consolidate checkpoint and block state comparison logic and improve code reusability.
Pending Status Tracking
maintainer/barrier_helper.go
Introduced new pendingUnreplicatingStatusMap data structure with associated types (pendingUnreplicatingStatusKey, pendingUnreplicatingStatus, pendingUnreplicatingStatusEntry) to track deferred statuses per dispatcher with concurrency protection. Implemented upsert, delete, snapshot, and len operations.
Test Coverage
maintainer/barrier_test.go
Added comprehensive tests validating deferred status behavior: queuing for unreplicating dispatchers, replay on replication start, proper ACK/WRITE action generation, and cleanup/drop conditions (dispatcher removal, state advancement, node movement).

Sequence Diagram

sequenceDiagram
    participant Disp as Dispatcher<br/>(Not Replicating)
    participant Barrier as Barrier
    participant PendingMap as Pending Status<br/>Map
    participant Handler as Status Handler

    Disp->>Barrier: TableSpanBlockStatus<br/>(WAITING stage)
    Barrier->>Barrier: Check: dispatcher<br/>not replicating?
    alt Dispatcher Not Replicating
        Barrier->>PendingMap: upsert(status)
        PendingMap-->>Barrier: stored
        Barrier->>Barrier: Skip normal<br/>handling
    else Dispatcher Replicating
        Barrier->>Handler: handleOneStatus()
        Handler-->>Barrier: ACK + WRITE actions
    end
    
    Disp->>Barrier: Replication starts
    Barrier->>Barrier: Resend()
    Barrier->>PendingMap: snapshot()
    PendingMap-->>Barrier: deferred statuses
    loop For each deferred status
        Barrier->>Barrier: dispatcherAlreadyPassedPendingState?
        alt Not passed
            Barrier->>Handler: handleOneStatus()
            Handler-->>Barrier: ACK + WRITE actions
            Barrier->>PendingMap: delete(status)
        else Already passed
            Barrier->>PendingMap: delete(status)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

lgtm, approved, release-note

Suggested reviewers

  • wk989898
  • lidezhu
  • 3AceShowHand

Poem

🐰 A barrier once stood, statuses would wait,
Till dispatchers were ready to replicate!
Deferred and buffered with bunny-like care,
They replay and ACK through the TiCDC air! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: deferring WAITING barrier statuses and replaying them after dispatcher enters replicating state.
Description check ✅ Passed The pull request description comprehensively addresses all required sections with clear problem statement, detailed explanation of changes, and appropriate test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 13, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

Hi @zier-one. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

Welcome @zier-one!

It looks like this is your first PR to pingcap/ticdc 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to pingcap/ticdc. 😃

@pingcap-cla-assistant
Copy link
Copy Markdown

pingcap-cla-assistant bot commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 13, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to defer block statuses from dispatchers that are not yet in the replicating state. It adds a pendingUnreplicatingStatusMap to the Barrier struct to track these statuses and replays them during the Resend cycle once the dispatcher enters the replicating state. The changes also include refactoring handleOneStatus to use common types and extracting barrier check logic into a reusable replicationPassedBarrier function. Feedback was provided regarding a potential misleading warning log that may trigger when statuses are deferred instead of processed immediately.

Comment on lines +94 to 96
if b.tryDeferUnreplicatingWaitingStatus(from, cfID, dispatcherID, status) {
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deferral logic introduced here will cause the warning no dispatcher status to send (located around line 129 in the full file) to trigger even when statuses are correctly deferred. This could lead to misleading logs and noise in production. Consider tracking the number of deferred statuses in this loop and suppressing that warning if any statuses were deferred.

@zier-one zier-one changed the title update maintainer: replay deferred WAITING barrier statuses after dispatcher enters replicating Apr 13, 2026
@ti-chi-bot ti-chi-bot bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 13, 2026
@zier-one zier-one marked this pull request as ready for review April 13, 2026 08:17
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-linked-issue labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
maintainer/barrier_helper.go (1)

157-172: Consider pre-allocating slice capacity in snapshot().

The slice is created with zero capacity, but the total count is known after iterating. This is a minor optimization opportunity.

♻️ Optional: Pre-allocate with estimated capacity
 func (m *pendingUnreplicatingStatusMap) snapshot() []pendingUnreplicatingStatusEntry {
 	m.mutex.Lock()
 	defer m.mutex.Unlock()

-	entries := make([]pendingUnreplicatingStatusEntry, 0)
+	// Estimate total entries for pre-allocation
+	total := 0
+	for _, statuses := range m.byDispatcher {
+		total += len(statuses)
+	}
+	entries := make([]pendingUnreplicatingStatusEntry, 0, total)
 	for dispatcherID, statuses := range m.byDispatcher {
 		for key, value := range statuses {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/barrier_helper.go` around lines 157 - 172, In
pendingUnreplicatingStatusMap.snapshot(), avoid starting entries with zero
capacity; first compute total := sum of len(statuses) for each statuses in
m.byDispatcher, then allocate entries := make([]pendingUnreplicatingStatusEntry,
0, total) before the nested loops; keep the rest of the loop appending to
entries and return entries—this preserves behavior but reduces reallocations
when building the slice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/barrier_helper.go`:
- Around line 157-172: In pendingUnreplicatingStatusMap.snapshot(), avoid
starting entries with zero capacity; first compute total := sum of len(statuses)
for each statuses in m.byDispatcher, then allocate entries :=
make([]pendingUnreplicatingStatusEntry, 0, total) before the nested loops; keep
the rest of the loop appending to entries and return entries—this preserves
behavior but reduces reallocations when building the slice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b30f1629-d57b-4da7-bd1d-169dc2193eea

📥 Commits

Reviewing files that changed from the base of the PR and between 0a418b4 and 7cd78f3.

📒 Files selected for processing (4)
  • maintainer/barrier.go
  • maintainer/barrier_event.go
  • maintainer/barrier_helper.go
  • maintainer/barrier_test.go

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 13, 2026
@wk989898
Copy link
Copy Markdown
Collaborator

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 13, 2026
@wk989898
Copy link
Copy Markdown
Collaborator

If there are lots of barrier events, does this change cause OOM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. do-not-merge/needs-triage-completed first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CREATE TABLE ... LIKE ... may incur an extra ~5s wait before the barrier advances

2 participants