Skip to content

maintainer: guard invalid global checkpoint#4709

Open
hongyunyan wants to merge 11 commits intopingcap:masterfrom
hongyunyan:fix/checkpoint-maxuint-guard
Open

maintainer: guard invalid global checkpoint#4709
hongyunyan wants to merge 11 commits intopingcap:masterfrom
hongyunyan:fix/checkpoint-maxuint-guard

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Apr 3, 2026

Background

The maintainer can aggregate a global checkpoint of math.MaxUint64 when every available watermark in a round is still effectively uninitialized. Once that value is promoted into the committed checkpoint, later redo dispatchers may be clamped to an invalid start ts and remain stuck during initialization.

Issue Number: close #4703

Motivation

A capture-local MaxUint64 watermark is not always invalid by itself, so filtering it at heartbeat ingestion would change the meaning of node-level reports. The safer minimal fix is to reject only the final aggregated result when the computed global checkpoint is still math.MaxUint64.

Summary of changes

  • guard calculateNewCheckpointTs() so a global checkpoint of math.MaxUint64 is skipped instead of being committed
  • add a maintainer unit test covering both a normal reported checkpoint and the poisoned MaxUint64 case

Testing

  • make fmt
  • go test ./maintainer

Summary by CodeRabbit

  • Bug Fixes

    • Prevent propagation of invalid global checkpoints (treated as non-updatable) and advance redo checkpoints only when valid redo watermarks are present.
  • Refactor

    • Simplified and isolated checkpoint vs. redo advancement paths to avoid incorrect updates.
  • Logging

    • Restored and enhanced dispatcher-status debug logging; added debug visibility for redo-update eligibility and computed redo checkpoints.
  • Chores

    • Adjusted initial checkpoint baselines used before aggregation.
  • Tests

    • Added tests covering valid/invalid global and redo checkpoint propagation scenarios.

Release note

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b018194-151a-4f14-97f6-b699524c4ef8

📥 Commits

Reviewing files that changed from the base of the PR and between 9df0613 and dd48835.

📒 Files selected for processing (1)
  • maintainer/maintainer_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • maintainer/maintainer_test.go

📝 Walkthrough

Walkthrough

Refactors maintainer checkpoint/redo advancement: introduces calculateNewRedoCheckpointTs() with a canUpdate gate, treats math.MaxUint64 as invalid to skip updates, moves redo advancement out of calCheckpointTs(), adds unit tests, re-enables dispatcher debug logging, and initializes dispatcher manager watermarks from startTs.

Changes

Cohort / File(s) Summary
Maintainer logic
maintainer/maintainer.go
Extracted redo checkpoint computation into calculateNewRedoCheckpointTs(); gate redo scheduling and redoMetaTs.ResolvedTs updates on canUpdate; treat math.MaxUint64 as invalid in checkpoint calculations; removed redo advancement from calCheckpointTs(); added debug fields for canUpdateRedoCheckpointTs and redoCheckpointTs.
Maintainer tests & helpers
maintainer/maintainer_test.go
Added tests for checkpoint/redo behaviors (calculateNewCheckpointTs, calCheckpointTs, handleRedoMetaTsMessage) including MaxUint64 cases; added test constructors (newMaintainerForCheckpointCalculationTest, newMaintainerForRedoCheckpointCalculationTest).
Dispatcher logging
downstreamadapter/dispatcher/basic_dispatcher.go
Re-enabled dispatcher-status debug logging to emit formatted dispatcherStatus, dispatcher id, action, and ack for incoming status messages (no control-flow change).
Dispatcher manager initialization
downstreamadapter/dispatchermanager/dispatcher_manager.go
Initialize DispatcherManager.latestWatermark and latestRedoWatermark with startTs instead of zero to set baseline watermarks at construction.

Sequence Diagram(s)

sequenceDiagram
    participant Maintainer
    participant Scheduler
    participant Barrier
    participant RedoController as RedoSpanController

    Maintainer->>Scheduler: query redo scheduler minima
    Maintainer->>Barrier: query redo barrier minima
    Maintainer->>RedoController: read per-capture redo heartbeats
    Maintainer->>Maintainer: calculateNewRedoCheckpointTs()
    alt canUpdate == true
        Maintainer->>RedoController: AdvanceMaintainerCommittedCheckpointTs(newRedoCheckpoint)
        Maintainer->>Maintainer: update redoMetaTs.ResolvedTs if higher
    else canUpdate == false
        Maintainer->>Maintainer: skip redo advancement and log (redoCheckpointTs = 0)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

size/M

Suggested reviewers

  • wk989898
  • lidezhu
  • asddongmen

Poem

🐰 I hopped through timestamps with nimble paws,
Pulled redo math out, rewired the laws,
I gated the checks and logged every beat,
Now watermarks march and heartbeats meet. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'maintainer: guard invalid global checkpoint' clearly summarizes the main change: preventing invalid math.MaxUint64 global checkpoints from being committed.
Description check ✅ Passed The description provides background, motivation, summary of changes, testing instructions, and release note. It clearly references issue #4703 and explains why the fix is needed.
Linked Issues check ✅ Passed The PR addresses all objectives from issue #4703: guards calculateNewCheckpointTs() to reject math.MaxUint64 global checkpoints, adds unit tests for both normal and poisoned MaxUint64 cases, and prevents redo dispatchers from being created with invalid startTs.
Out of Scope Changes check ✅ Passed All changes are directly related to the objective of preventing math.MaxUint64 global checkpoint propagation. Changes to maintainer, maintainer_test, dispatcher, and dispatcher manager are all aligned with fixing the checkpoint guard issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a check in the calculateNewCheckpointTs function to prevent advancing the checkpoint when the global checkpoint is invalid (represented by math.MaxUint64). This change ensures that the committed checkpoint is not corrupted by uninitialized values. Corresponding unit tests were added to verify this behavior. Feedback suggests enhancing the warning log by including the current committed checkpoint timestamp to improve debuggability.

Comment on lines +731 to +735
log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",
zap.Stringer("changefeedID", m.changefeedID),
zap.Uint64("resolvedTs", newWatermark.ResolvedTs),
zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),
zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warning log message would be more helpful for debugging if it included the current committed checkpoint timestamp. This allows operators to see the state of the changefeed when the calculated global checkpoint is invalid (e.g., during initialization or when all nodes report uninitialized watermarks), consistent with the logging pattern used earlier in this function.

Suggested change
log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",
zap.Stringer("changefeedID", m.changefeedID),
zap.Uint64("resolvedTs", newWatermark.ResolvedTs),
zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),
zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))
log.Warn("checkpointTs can not be advanced, since global checkpoint is invalid",
zap.Stringer("changefeedID", m.changefeedID),
zap.Uint64("checkpointTs", m.getWatermark().CheckpointTs),
zap.Uint64("resolvedTs", newWatermark.ResolvedTs),
zap.Uint64("minCheckpointTsForScheduler", minCheckpointTsForScheduler),
zap.Uint64("minCheckpointTsForBarrier", minCheckpointTsForBarrier))

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
maintainer/maintainer_test.go (1)

402-430: Add a mixed sentinel regression case.

The PR rationale depends on rejecting only the aggregated math.MaxUint64 result. These subtests cover the finite path and the all-max path, but not the mixed case where one reported watermark is math.MaxUint64 and another constraint still makes the global minimum finite. Locking that in here would catch a future regression that filters the sentinel too early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_test.go` around lines 402 - 430, Add a new subtest
"mixed sentinel" inside TestMaintainerCalculateNewCheckpointTs that inserts two
reported watermarks into m.checkpointTsByCapture: one with
CheckpointTs/ResolvedTs = math.MaxUint64 and another with a finite value (e.g.,
200), then call m.calculateNewCheckpointTs() and assert that canUpdate is true
and the returned watermark equals the finite minimum (not the sentinel). This
exercise targets calculateNewCheckpointTs and checkpointTsByCapture to ensure
the MaxUint64 sentinel is ignored unless all reports are MaxUint64.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/maintainer_test.go`:
- Around line 402-430: Add a new subtest "mixed sentinel" inside
TestMaintainerCalculateNewCheckpointTs that inserts two reported watermarks into
m.checkpointTsByCapture: one with CheckpointTs/ResolvedTs = math.MaxUint64 and
another with a finite value (e.g., 200), then call m.calculateNewCheckpointTs()
and assert that canUpdate is true and the returned watermark equals the finite
minimum (not the sentinel). This exercise targets calculateNewCheckpointTs and
checkpointTsByCapture to ensure the MaxUint64 sentinel is ignored unless all
reports are MaxUint64.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8561c846-e370-4378-a8b2-be057bf65922

📥 Commits

Reviewing files that changed from the base of the PR and between f675332 and 02acbbe.

📒 Files selected for processing (2)
  • maintainer/maintainer.go
  • maintainer/maintainer_test.go

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test all

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test all

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/retest

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 6, 2026
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test all

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 7, 2026
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/retest

// zap.Stringer("dispatcher", d.id),
// zap.Any("action", dispatcherStatus.GetAction()),
// zap.Any("ack", dispatcherStatus.GetAck()))
log.Debug("dispatcher handle dispatcher status",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove these log, looks useless.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a debug level log, it will not print in prod scenario, and it's may very important when there is some ddl issues in test. So I think we should keep this log

}
newWatermark.UpdateMin(watermark)
redoWatermark, canUpdate := m.calculateNewRedoCheckpointTs()
if canUpdate && m.controller.redoSpanController != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is redoSpanController nil in this case? If it can be nil, looks the calculateNewRedoCheckpointTs should not be called.

zap.Any("checkpointTs", m.getWatermark().CheckpointTs),
zap.Any("resolvedTs", newWatermark.CheckpointTs),
zap.Bool("canUpdateRedoCheckpointTs", canUpdate),
zap.Uint64("redoCheckpointTs", func() uint64 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird, a closure in the log field.

// MaxUint64 means this round still has no effective global checkpoint.
// Skipping the update keeps the committed checkpoint from being poisoned.
if newWatermark.CheckpointTs == math.MaxUint64 {
log.Debug("checkpointTs can not be advanced, since global checkpoint is invalid",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be need the definition of the invalid. or remove this log.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the invalid reason here

@ti-chi-bot ti-chi-bot bot added the lgtm label Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-07 00:25:35.488116278 +0000 UTC m=+829540.693476335: ☑️ agreed by lidezhu.
  • 2026-04-08 02:29:35.180841307 +0000 UTC m=+923380.386201364: ☑️ agreed by 3AceShowHand.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2026
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/retest

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

@hongyunyan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-kafka-integration-heavy dd48835 link unknown /test pull-cdc-kafka-integration-heavy
pull-cdc-mysql-integration-light dd48835 link unknown /test pull-cdc-mysql-integration-light
pull-cdc-mysql-integration-heavy dd48835 link unknown /test pull-cdc-mysql-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

redo dispatcher can be recreated with startTs = MaxUint64 and get stuck in Initializing in fail_over_ddl_mix

3 participants