maintainer: replay deferred WAITING barrier statuses after dispatcher enters replicating by zier-one · Pull Request #4808 · pingcap/ticdc

zier-one · 2026-04-13T07:44:06Z

What problem does this PR solve?

Issue Number: close #4810

This PR fixes a timing window in the maintainer barrier flow. Before this change, when a non-DDL dispatcher reported a WAITING block status before the maintainer had moved it from scheduling to replicating, the status was ignored immediately. As a result, the barrier could not advance on the first report and had to wait for the dispatcher's local fixed 5s resend task.

This change stores such deferred WAITING statuses inside the maintainer and replays them through the existing barrier state machine after the dispatcher actually becomes replicating, so barrier progress no longer depends on the dispatcher's local 5s resend as the primary recovery path.

What is changed and how it works?

This PR applies to barrier scenarios where a non-DDL dispatcher can observe a barrier before it is officially moved into the replicating set, especially:

newly created or recreated dispatchers participating in a barrier for the first time;
DDLs such as CREATE TABLE ... LIKE ... that bring referenced-table dispatchers into the same barrier;
dispatcher recreation after migration, split, or merge;
any DDL / syncpoint path where the first WAITING report can land in the scheduling -> replicating transition window.

Before:

the first WAITING report could be ignored if it arrived during the non-replicating window;
barrier progress then depended on the dispatcher's local 5s resend task;
DDL / syncpoint tail latency could be amplified by that fixed resend interval.

After:

the maintainer defers and caches such WAITING statuses instead of dropping them;
once the dispatcher becomes replicating, the maintainer actively replays the cached status in periodic Barrier.Resend();
barrier progress resumes through the existing ACK / write-action path without treating the dispatcher's local 5s resend as the primary compensation path.

Check List

Tests

[x] Unit test

Added coverage for:

TestDeferUnreplicatingWaitingStatus
TestResendReplaysDeferredWaitingStatusAfterDispatcherReplicating
TestResendDropsDeferredWaitingStatusWhenDispatcherMissing
TestResendDropsDeferredWaitingStatusWhenDispatcherAlreadyPassed
TestResendDropsDeferredWaitingStatusWhenDispatcherMoved

Also re-ran the nearby barrier scheduling regression:

TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule

Summary by CodeRabbit

Bug Fixes
- Improved handling of status reports during dispatcher state transitions to prevent premature processing and ensure reliable replication coordination.
Tests
- Added comprehensive test coverage for dispatcher status handling during initialization and state changes.

Release note

Improved the expected replication efficiency of `CREATE TABLE ... LIKE ...` by optimizing barrier coordination and the blocking DDL progression path to reduce extra waits introduced by referenced-table dispatchers.

ti-chi-bot · 2026-04-13T07:44:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wk989898 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-13T07:44:15Z

📝 Walkthrough

Walkthrough

This PR introduces a deferred status mechanism for dispatchers not yet replicating. Block status reports arriving while a dispatcher is in non-replicating state with WAITING stage status are buffered and replayed once replication begins, improving consistency in barrier event handling.

Changes

Cohort / File(s)	Summary
Core Barrier Logic `maintainer/barrier.go`	Added `pendingUnreplicatingStatuses` field and deferral logic in `HandleStatus`. New methods `tryDeferUnreplicatingWaitingStatus`, `drainPendingUnreplicatingStatuses`, and `dispatcherAlreadyPassedPendingState` manage deferred statuses. Updated `handleOneStatus` signature to accept `common.ChangeFeedID` directly. Deferred statuses are replayed on `Resend()` with proper ACK/WRITE action generation.
Event Forwarding Helper `maintainer/barrier_event.go`	Extracted forwarding decision logic into new `replicationPassedBarrier` helper method to consolidate checkpoint and block state comparison logic and improve code reusability.
Pending Status Tracking `maintainer/barrier_helper.go`	Introduced new `pendingUnreplicatingStatusMap` data structure with associated types (`pendingUnreplicatingStatusKey`, `pendingUnreplicatingStatus`, `pendingUnreplicatingStatusEntry`) to track deferred statuses per dispatcher with concurrency protection. Implemented `upsert`, `delete`, `snapshot`, and `len` operations.
Test Coverage `maintainer/barrier_test.go`	Added comprehensive tests validating deferred status behavior: queuing for unreplicating dispatchers, replay on replication start, proper ACK/WRITE action generation, and cleanup/drop conditions (dispatcher removal, state advancement, node movement).

Sequence Diagram

sequenceDiagram
    participant Disp as Dispatcher<br/>(Not Replicating)
    participant Barrier as Barrier
    participant PendingMap as Pending Status<br/>Map
    participant Handler as Status Handler

    Disp->>Barrier: TableSpanBlockStatus<br/>(WAITING stage)
    Barrier->>Barrier: Check: dispatcher<br/>not replicating?
    alt Dispatcher Not Replicating
        Barrier->>PendingMap: upsert(status)
        PendingMap-->>Barrier: stored
        Barrier->>Barrier: Skip normal<br/>handling
    else Dispatcher Replicating
        Barrier->>Handler: handleOneStatus()
        Handler-->>Barrier: ACK + WRITE actions
    end
    
    Disp->>Barrier: Replication starts
    Barrier->>Barrier: Resend()
    Barrier->>PendingMap: snapshot()
    PendingMap-->>Barrier: deferred statuses
    loop For each deferred status
        Barrier->>Barrier: dispatcherAlreadyPassedPendingState?
        alt Not passed
            Barrier->>Handler: handleOneStatus()
            Handler-->>Barrier: ACK + WRITE actions
            Barrier->>PendingMap: delete(status)
        else Already passed
            Barrier->>PendingMap: delete(status)
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

lgtm, approved, release-note

Suggested reviewers

wk989898
lidezhu
3AceShowHand

Poem

🐰 A barrier once stood, statuses would wait,
Till dispatchers were ready to replicate!
Deferred and buffered with bunny-like care,
They replay and ACK through the TiCDC air! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly summarizes the main change: deferring WAITING barrier statuses and replaying them after dispatcher enters replicating state.
Description check	✅ Passed	The pull request description comprehensively addresses all required sections with clear problem statement, detailed explanation of changes, and appropriate test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ti-chi-bot · 2026-04-13T07:44:18Z

Hi @zier-one. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-13T07:44:18Z

Welcome @zier-one!

It looks like this is your first PR to pingcap/ticdc 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to pingcap/ticdc. 😃

pingcap-cla-assistant · 2026-04-13T07:44:21Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces a mechanism to defer block statuses from dispatchers that are not yet in the replicating state. It adds a pendingUnreplicatingStatusMap to the Barrier struct to track these statuses and replays them during the Resend cycle once the dispatcher enters the replicating state. The changes also include refactoring handleOneStatus to use common types and extracting barrier check logic into a reusable replicationPassedBarrier function. Feedback was provided regarding a potential misleading warning log that may trigger when statuses are deferred instead of processed immediately.

gemini-code-assist · 2026-04-13T07:46:23Z

maintainer/barrier.go

+		if b.tryDeferUnreplicatingWaitingStatus(from, cfID, dispatcherID, status) {
+			continue
 		}


The deferral logic introduced here will cause the warning no dispatcher status to send (located around line 129 in the full file) to trigger even when statuses are correctly deferred. This could lead to misleading logs and noise in production. Consider tracking the number of deferred statuses in this loop and suppressing that warning if any statuses were deferred.

coderabbitai

🧹 Nitpick comments (1)

maintainer/barrier_helper.go (1)

157-172: Consider pre-allocating slice capacity in snapshot().

The slice is created with zero capacity, but the total count is known after iterating. This is a minor optimization opportunity.

♻️ Optional: Pre-allocate with estimated capacity

 func (m *pendingUnreplicatingStatusMap) snapshot() []pendingUnreplicatingStatusEntry {
 	m.mutex.Lock()
 	defer m.mutex.Unlock()

-	entries := make([]pendingUnreplicatingStatusEntry, 0)
+	// Estimate total entries for pre-allocation
+	total := 0
+	for _, statuses := range m.byDispatcher {
+		total += len(statuses)
+	}
+	entries := make([]pendingUnreplicatingStatusEntry, 0, total)
 	for dispatcherID, statuses := range m.byDispatcher {
 		for key, value := range statuses {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@maintainer/barrier_helper.go` around lines 157 - 172, In
pendingUnreplicatingStatusMap.snapshot(), avoid starting entries with zero
capacity; first compute total := sum of len(statuses) for each statuses in
m.byDispatcher, then allocate entries := make([]pendingUnreplicatingStatusEntry,
0, total) before the nested loops; keep the rest of the loop appending to
entries and return entries—this preserves behavior but reduces reallocations
when building the slice.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/barrier_helper.go`:
- Around line 157-172: In pendingUnreplicatingStatusMap.snapshot(), avoid
starting entries with zero capacity; first compute total := sum of len(statuses)
for each statuses in m.byDispatcher, then allocate entries :=
make([]pendingUnreplicatingStatusEntry, 0, total) before the nested loops; keep
the rest of the loop appending to entries and return entries—this preserves
behavior but reduces reallocations when building the slice.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b30f1629-d57b-4da7-bd1d-169dc2193eea

📥 Commits

Reviewing files that changed from the base of the PR and between 0a418b4 and 7cd78f3.

📒 Files selected for processing (4)

maintainer/barrier.go
maintainer/barrier_event.go
maintainer/barrier_helper.go
maintainer/barrier_test.go

wk989898 · 2026-04-13T09:14:39Z

/ok-to-test

wk989898 · 2026-04-13T09:36:53Z

If there are lots of barrier events, does this change cause OOM?

update

7cd78f3

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 13, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 13, 2026

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

zier-one changed the title ~~update~~ maintainer: replay deferred WAITING barrier statuses after dispatcher enters replicating Apr 13, 2026

ti-chi-bot bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 13, 2026

zier-one marked this pull request as ready for review April 13, 2026 08:17

ti-chi-bot bot added do-not-merge/needs-triage-completed and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-linked-issue labels Apr 13, 2026

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 13, 2026

ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 13, 2026

Conversation

zier-one commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Summary by CodeRabbit

Release note

Uh oh!

ti-chi-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

ti-chi-bot bot commented Apr 13, 2026

Uh oh!

ti-chi-bot bot commented Apr 13, 2026

Uh oh!

pingcap-cla-assistant bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

wk989898 commented Apr 13, 2026

Uh oh!

wk989898 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zier-one commented Apr 13, 2026 •

edited

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

pingcap-cla-assistant bot commented Apr 13, 2026 •

edited

Loading