Skip to content

syncer(dm): prevent checkpoint flush after BeginTx error#12644

Open
GMHDBJD wants to merge 1 commit into
pingcap:masterfrom
GMHDBJD:tiflow1/fix-12626-checkpoint-begin-error
Open

syncer(dm): prevent checkpoint flush after BeginTx error#12644
GMHDBJD wants to merge 1 commit into
pingcap:masterfrom
GMHDBJD:tiflow1/fix-12626-checkpoint-begin-error

Conversation

@GMHDBJD
Copy link
Copy Markdown
Contributor

@GMHDBJD GMHDBJD commented May 19, 2026

What problem does this PR solve?

Issue Number: close #12626

What is changed and how it works?

This PR fixes a DM correctness issue where checkpoint flush could still advance after a downstream transaction BeginTx failure.

Changes:

  • Introduce a shared downstream execution error predicate that includes ErrDBExecuteFailedBegin together with existing ErrDBExecuteFailed and ErrDBUnExpect.
  • Use the predicate in sync, async, and checkpoint flush worker guards so checkpoint flush is skipped after downstream begin failures.
  • Treat begin failures as downstream execution errors when flushing the safe-mode exit point on task exit.
  • Avoid running key-not-found diagnostics after ExecuteSQL already returned an execution error.
  • Add a regression test for ErrDBExecuteFailedBegin(sql.ErrConnDone) to ensure checkpoint flush is skipped.

Check List

Tests

  • Unit test
go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1
go test ./dm/syncer -run 'Test(CheckpointFlushWorkerSkipsCheckpointOnBeginError|JudgeKeyNotFound|EnableSafeModeInitializationPhase)$' -count=1

Also ran:

make fmt
git diff --check

Questions

Will it cause performance regression or break compatibility?

No. The change only broadens existing checkpoint-blocking error classification to include downstream transaction begin failures and suppresses a misleading diagnostic after failed execution.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a DM correctness issue that could flush checkpoints after downstream transaction begin failures.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. area/dm Issues or PRs related to DM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 19, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes downstream execution error handling by introducing the isDownstreamExecutionError helper function, which now includes terror.ErrDBExecuteFailedBegin in its checks. This ensures that checkpoint flushes are correctly skipped when a downstream transaction fails to begin, preventing potential data inconsistency. The changes also include a new unit test to verify this behavior and a safety check in dml_worker.go to ensure key-not-found logic only executes when no other errors are present. I have no feedback to provide as there were no review comments to assess.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 19, 2026

@GMHDBJD: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify ad6182a link true /test pull-verify
pull-dm-integration-test ad6182a link true /test pull-dm-integration-test
pull-dm-integration-test-next-gen ad6182a link false /test pull-dm-integration-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dm Issues or PRs related to DM. do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume

1 participant