maintainer: guard invalid global checkpoint by hongyunyan · Pull Request #4709 · pingcap/ticdc

hongyunyan · 2026-04-03T19:01:07Z

Background

The maintainer can aggregate a global checkpoint of math.MaxUint64 when every available watermark in a round is still effectively uninitialized. Once that value is promoted into the committed checkpoint, later redo dispatchers may be clamped to an invalid start ts and remain stuck during initialization.

Issue Number: close #4703

Motivation

A capture-local MaxUint64 watermark is not always invalid by itself, so filtering it at heartbeat ingestion would change the meaning of node-level reports. The safer minimal fix is to reject only the final aggregated result when the computed global checkpoint is still math.MaxUint64.

Summary of changes

guard calculateNewCheckpointTs() so a global checkpoint of math.MaxUint64 is skipped instead of being committed
add a maintainer unit test covering both a normal reported checkpoint and the poisoned MaxUint64 case

Testing

make fmt
go test ./maintainer

Summary by CodeRabbit

Bug Fixes
- Prevent propagation of invalid global checkpoints (treated as non-updatable) and advance redo checkpoints only when valid redo watermarks are present.
Refactor
- Simplified and isolated checkpoint vs. redo advancement paths to avoid incorrect updates.
Logging
- Restored and enhanced dispatcher-status debug logging; added debug visibility for redo-update eligibility and computed redo checkpoints.
Chores
- Adjusted initial checkpoint baselines used before aggregation.
Tests
- Added tests covering valid/invalid global and redo checkpoint propagation scenarios.

Release note

None

coderabbitai · 2026-04-03T19:01:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b018194-151a-4f14-97f6-b699524c4ef8

📥 Commits

Reviewing files that changed from the base of the PR and between 9df0613 and dd48835.

📒 Files selected for processing (1)

maintainer/maintainer_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

maintainer/maintainer_test.go

📝 Walkthrough

Walkthrough

Refactors maintainer checkpoint/redo advancement: introduces calculateNewRedoCheckpointTs() with a canUpdate gate, treats math.MaxUint64 as invalid to skip updates, moves redo advancement out of calCheckpointTs(), adds unit tests, re-enables dispatcher debug logging, and initializes dispatcher manager watermarks from startTs.

Changes

Cohort / File(s)	Summary
Maintainer logic `maintainer/maintainer.go`	Extracted redo checkpoint computation into `calculateNewRedoCheckpointTs()`; gate redo scheduling and `redoMetaTs.ResolvedTs` updates on `canUpdate`; treat `math.MaxUint64` as invalid in checkpoint calculations; removed redo advancement from `calCheckpointTs()`; added debug fields for `canUpdateRedoCheckpointTs` and `redoCheckpointTs`.
Maintainer tests & helpers `maintainer/maintainer_test.go`	Added tests for checkpoint/redo behaviors (`calculateNewCheckpointTs`, `calCheckpointTs`, `handleRedoMetaTsMessage`) including MaxUint64 cases; added test constructors (`newMaintainerForCheckpointCalculationTest`, `newMaintainerForRedoCheckpointCalculationTest`).
Dispatcher logging `downstreamadapter/dispatcher/basic_dispatcher.go`	Re-enabled dispatcher-status debug logging to emit formatted `dispatcherStatus`, `dispatcher` id, `action`, and `ack` for incoming status messages (no control-flow change).
Dispatcher manager initialization `downstreamadapter/dispatchermanager/dispatcher_manager.go`	Initialize `DispatcherManager.latestWatermark` and `latestRedoWatermark` with `startTs` instead of zero to set baseline watermarks at construction.

Sequence Diagram(s)

sequenceDiagram
    participant Maintainer
    participant Scheduler
    participant Barrier
    participant RedoController as RedoSpanController

    Maintainer->>Scheduler: query redo scheduler minima
    Maintainer->>Barrier: query redo barrier minima
    Maintainer->>RedoController: read per-capture redo heartbeats
    Maintainer->>Maintainer: calculateNewRedoCheckpointTs()
    alt canUpdate == true
        Maintainer->>RedoController: AdvanceMaintainerCommittedCheckpointTs(newRedoCheckpoint)
        Maintainer->>Maintainer: update redoMetaTs.ResolvedTs if higher
    else canUpdate == false
        Maintainer->>Maintainer: skip redo advancement and log (redoCheckpointTs = 0)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

maintainer,tests: clamp dispatcher StartTs to committed checkpoint #4548: Modifies maintainer checkpoint-calculation and advancement paths; closely related to these refactor and guard changes.

Suggested labels

size/M

Suggested reviewers

wk989898
lidezhu
asddongmen

Poem

🐰 I hopped through timestamps with nimble paws,
Pulled redo math out, rewired the laws,
I gated the checks and logged every beat,
Now watermarks march and heartbeats meet. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'maintainer: guard invalid global checkpoint' clearly summarizes the main change: preventing invalid math.MaxUint64 global checkpoints from being committed.
Description check	✅ Passed	The description provides background, motivation, summary of changes, testing instructions, and release note. It clearly references issue `#4703` and explains why the fix is needed.
Linked Issues check	✅ Passed	The PR addresses all objectives from issue `#4703`: guards calculateNewCheckpointTs() to reject math.MaxUint64 global checkpoints, adds unit tests for both normal and poisoned MaxUint64 cases, and prevents redo dispatchers from being created with invalid startTs.
Out of Scope Changes check	✅ Passed	All changes are directly related to the objective of preventing math.MaxUint64 global checkpoint propagation. Changes to maintainer, maintainer_test, dispatcher, and dispatcher manager are all aligned with fixing the checkpoint guard issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a check in the calculateNewCheckpointTs function to prevent advancing the checkpoint when the global checkpoint is invalid (represented by math.MaxUint64). This change ensures that the committed checkpoint is not corrupted by uninitialized values. Corresponding unit tests were added to verify this behavior. Feedback suggests enhancing the warning log by including the current committed checkpoint timestamp to improve debuggability.

gemini-code-assist · 2026-04-03T19:02:49Z

maintainer/maintainer.go

+		log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",
+			zap.Stringer("changefeedID", m.changefeedID),
+			zap.Uint64("resolvedTs", newWatermark.ResolvedTs),
+			zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),
+			zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))


The warning log message would be more helpful for debugging if it included the current committed checkpoint timestamp. This allows operators to see the state of the changefeed when the calculated global checkpoint is invalid (e.g., during initialization or when all nodes report uninitialized watermarks), consistent with the logging pattern used earlier in this function.

Suggested change

log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",

zap.Stringer("changefeedID", m.changefeedID),

zap.Uint64("resolvedTs", newWatermark.ResolvedTs),

zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),

zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))

log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",

zap.Stringer("changefeedID", m.changefeedID),

zap.Uint64("checkpointTs", m.getWatermark().CheckpointTs),

zap.Uint64("resolvedTs", newWatermark.ResolvedTs),

zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),

zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))

coderabbitai

🧹 Nitpick comments (1)

maintainer/maintainer_test.go (1)
402-430: Add a mixed sentinel regression case.

The PR rationale depends on rejecting only the aggregated math.MaxUint64 result. These subtests cover the finite path and the all-max path, but not the mixed case where one reported watermark is math.MaxUint64 and another constraint still makes the global minimum finite. Locking that in here would catch a future regression that filters the sentinel too early.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_test.go` around lines 402 - 430, Add a new subtest
"mixed sentinel" inside TestMaintainerCalculateNewCheckpointTs that inserts two
reported watermarks into m.checkpointTsByCapture: one with
CheckpointTs/ResolvedTs = math.MaxUint64 and another with a finite value (e.g.,
200), then call m.calculateNewCheckpointTs() and assert that canUpdate is true
and the returned watermark equals the finite minimum (not the sentinel). This
exercise targets calculateNewCheckpointTs and checkpointTsByCapture to ensure
the MaxUint64 sentinel is ignored unless all reports are MaxUint64.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/maintainer_test.go`:
- Around line 402-430: Add a new subtest "mixed sentinel" inside
TestMaintainerCalculateNewCheckpointTs that inserts two reported watermarks into
m.checkpointTsByCapture: one with CheckpointTs/ResolvedTs = math.MaxUint64 and
another with a finite value (e.g., 200), then call m.calculateNewCheckpointTs()
and assert that canUpdate is true and the returned watermark equals the finite
minimum (not the sentinel). This exercise targets calculateNewCheckpointTs and
checkpointTsByCapture to ensure the MaxUint64 sentinel is ignored unless all
reports are MaxUint64.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8561c846-e370-4378-a8b2-be057bf65922

📥 Commits

Reviewing files that changed from the base of the PR and between f675332 and 02acbbe.

📒 Files selected for processing (2)

maintainer/maintainer.go
maintainer/maintainer_test.go

hongyunyan · 2026-04-03T22:41:07Z

/test all

hongyunyan · 2026-04-04T04:27:10Z

/test all

hongyunyan · 2026-04-04T05:02:42Z

/retest

hongyunyan · 2026-04-06T22:26:29Z

/test all

hongyunyan · 2026-04-06T23:28:31Z

/test pull-cdc-mysql-integration-heavy

hongyunyan · 2026-04-07T00:20:55Z

/test all

hongyunyan · 2026-04-07T02:37:30Z

/retest

hongyunyan · 2026-04-07T03:42:06Z

/retest

3AceShowHand · 2026-04-08T02:26:16Z

downstreamadapter/dispatcher/basic_dispatcher.go

-	// 	zap.Stringer("dispatcher", d.id),
-	// 	zap.Any("action", dispatcherStatus.GetAction()),
-	// 	zap.Any("ack", dispatcherStatus.GetAck()))
+	log.Debug("dispatcher handle dispatcher status",


can we remove these log, looks useless.

It's a debug level log, it will not print in prod scenario, and it's may very important when there is some ddl issues in test. So I think we should keep this log

3AceShowHand · 2026-04-08T02:27:33Z

maintainer/maintainer.go

-				}
-				newWatermark.UpdateMin(watermark)
+			redoWatermark, canUpdate := m.calculateNewRedoCheckpointTs()
+			if canUpdate && m.controller.redoSpanController != nil {


Is redoSpanController nil in this case? If it can be nil, looks the calculateNewRedoCheckpointTs should not be called.

3AceShowHand · 2026-04-08T02:28:19Z

maintainer/maintainer.go

 				zap.Any("checkpointTs", m.getWatermark().CheckpointTs),
-				zap.Any("resolvedTs", newWatermark.CheckpointTs),
+				zap.Bool("canUpdateRedoCheckpointTs", canUpdate),
+				zap.Uint64("redoCheckpointTs", func() uint64 {


This looks weird, a closure in the log field.

3AceShowHand · 2026-04-08T02:28:56Z

maintainer/maintainer.go

+	// MaxUint64 means this round still has no effective global checkpoint.
+	// Skipping the update keeps the committed checkpoint from being poisoned.
+	if newWatermark.CheckpointTs == math.MaxUint64 {
+		log.Debug("checkpointTs can not be advanced, since global checkpoint is invalid",


may be need the definition of the invalid. or remove this log.

Add the invalid reason here

ti-chi-bot · 2026-04-08T02:29:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [3AceShowHand,lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-08T02:29:36Z

[LGTM Timeline notifier]

Timeline:

2026-04-07 00:25:35.488116278 +0000 UTC m=+829540.693476335: ☑️ agreed by lidezhu.
2026-04-08 02:29:35.180841307 +0000 UTC m=+923380.386201364: ☑️ agreed by 3AceShowHand.

hongyunyan · 2026-04-09T01:10:18Z

/retest

ti-chi-bot · 2026-04-09T03:20:51Z

@hongyunyan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cdc-kafka-integration-heavy	`dd48835`	link	unknown	`/test pull-cdc-kafka-integration-heavy`
pull-cdc-mysql-integration-light	`dd48835`	link	unknown	`/test pull-cdc-mysql-integration-light`
pull-cdc-mysql-integration-heavy	`dd48835`	link	unknown	`/test pull-cdc-mysql-integration-heavy`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

maintainer: guard invalid global checkpoint

02acbbe

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 3, 2026

ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

maintainer: polish checkpoint guard test

288bf10

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2026

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

hongyunyan added 3 commits April 4, 2026 08:15

downstreamadapter: ignore initializing watermark in heartbeat

bfdd8c2

downstreamadapter: initialize manager watermark from startTs

627518d

update

b76dfb2

update

ca7d9cf

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 6, 2026

update

29cc680

update

929d70c

ti-chi-bot bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue do-not-merge/needs-triage-completed labels Apr 7, 2026

lidezhu approved these changes Apr 7, 2026

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 7, 2026

3AceShowHand reviewed Apr 8, 2026

View reviewed changes

3AceShowHand approved these changes Apr 8, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Apr 8, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 8, 2026

update

9df0613

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2026

hongyunyan added 2 commits April 9, 2026 07:32

Merge remote-tracking branch 'upstream/master' into codex/push/pr4709

60ed145

maintainer: fix checkpoint guard tests after helper update

dd48835

Conversation

hongyunyan commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Motivation

Summary of changes

Testing

Summary by CodeRabbit

Release note

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

hongyunyan commented Apr 3, 2026

Uh oh!

hongyunyan commented Apr 4, 2026

Uh oh!

hongyunyan commented Apr 4, 2026

Uh oh!

hongyunyan commented Apr 6, 2026

Uh oh!

hongyunyan commented Apr 6, 2026

Uh oh!

hongyunyan commented Apr 7, 2026

Uh oh!

hongyunyan commented Apr 7, 2026

Uh oh!

hongyunyan commented Apr 7, 2026

Uh oh!

3AceShowHand Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

hongyunyan Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

3AceShowHand Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

3AceShowHand Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

3AceShowHand Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

hongyunyan Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Uh oh!

hongyunyan commented Apr 9, 2026

Uh oh!

ti-chi-bot bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hongyunyan commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading