Skip to content

eventBroker: remove two sgate syncpoint#4807

Draft
asddongmen wants to merge 3 commits intopingcap:masterfrom
asddongmen:0413-remove-two-sgate-syncpoint
Draft

eventBroker: remove two sgate syncpoint#4807
asddongmen wants to merge 3 commits intopingcap:masterfrom
asddongmen:0413-remove-two-sgate-syncpoint

Conversation

@asddongmen
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

  • Remove two stage syncpoint, wihch is too complicated to maintain.
  • Add a checkpoint-based cap for scan end when syncpoint is enabled:
    scanEnd = min(scanEnd, checkpointTs + multiplier*syncPointInterval).
    Default multiplier is 2.
  • Apply this cap in:
    1. normal scan range calculation,
    2. pending-DDL local-advance fallback,
    3. table-trigger DDL/resolved-ts path.
  • Add lag-based syncpoint suppression in emitSyncPointEventIfNeeded:
    • suppress when lag(sentResolvedTs, checkpointTs) > 20m,
    • resume when lag <= 15m (hysteresis),
    • always advance nextSyncPoint even when emission is suppressed.
  • Add debug config knobs:
    • sync-point-checkpoint-cap-multiplier (default 2)
    • sync-point-lag-suppress-threshold (default 20m)
    • sync-point-lag-resume-threshold (default 15m)
  • Add metrics:
    • syncpoint_lag_seconds
    • syncpoint_suppressed_count
    • scan_capped_by_checkpoint_count
  • Add focused unit tests for scan capping and syncpoint suppress/resume behavior.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
- cap scan upper bound by checkpointTs + 2*syncPointInterval when syncpoint is enabled
- suppress syncpoint emission when dispatcher lag exceeds threshold, while still advancing nextSyncPoint
- resume syncpoint emission with hysteresis to avoid flapping
- apply checkpoint cap to normal scan path, pending-DDL local advance, and table-trigger DDL path
- add metrics for syncpoint lag, suppression count, and checkpoint-cap hits
- add unit tests for checkpoint cap and suppress/resume behavior

Signed-off-by: dongmen <414110582@qq.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 13, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign flowbehappy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 425727a6-a327-4742-8e44-1dc3fd20ea89

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 13, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the syncpoint handling logic by removing the two-stage prepare/commit state machine and introducing lag-based suppression and checkpoint-based scan capping. New configuration options and metrics are added to support these features. Review feedback identifies critical issues including a race condition in syncpoint emission, an incorrect timestamp comparison that delays syncpoints, the loss of event type validation in action matching, and a reversal of the required DDL-to-syncpoint emission order.


pendingIsSyncPoint := b.blockPendingEvent.GetType() == commonEvent.TypeSyncPointEvent
return b.blockCommitTs == action.CommitTs && pendingIsSyncPoint == action.IsSyncPoint
return b.blockCommitTs == action.CommitTs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check for action.IsSyncPoint was removed. If a DDL event and a SyncPoint event share the same CommitTs, the dispatcher might incorrectly match an action intended for one event to the other. This can lead to incorrect processing, such as passing a DDL that should have been written to the downstream.

	pendingIsSyncPoint := b.blockPendingEvent.GetType() == commonEvent.TypeSyncPointEvent
	return b.blockCommitTs == action.CommitTs && pendingIsSyncPoint == action.IsSyncPoint

Comment on lines +706 to +708
for d.enableSyncPoint && ts > d.nextSyncPoint.Load() {
commitTs := d.nextSyncPoint.Load()
if !d.changefeedStat.isSyncPointInCommitStage(commitTs) {
if ts <= commitTs {
return
}
d.changefeedStat.tryEnterSyncPointPrepare(commitTs)
if !d.changefeedStat.isSyncPointInCommitStage(commitTs) {
return
}
} else if ts < commitTs {
return
}

nextSyncPoint := oracle.GoTimeToTS(oracle.GetTimeFromTS(commitTs).Add(d.syncPointInterval))
// Advance nextSyncPoint with CAS so concurrent send paths cannot emit the same
// syncpoint twice or move nextSyncPoint backward.
if !d.nextSyncPoint.CompareAndSwap(commitTs, nextSyncPoint) {
d.nextSyncPoint.Store(oracle.GoTimeToTS(oracle.GetTimeFromTS(commitTs).Add(d.syncPointInterval)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two critical issues in this loop logic:

  1. The condition ts > d.nextSyncPoint.Load() prevents a syncpoint from being emitted when the current timestamp ts is exactly equal to the syncpoint timestamp. This causes syncpoints to be delayed until the next event with a strictly greater timestamp arrives, or missed entirely if the stream stops at the syncpoint timestamp. It should be ts >= d.nextSyncPoint.Load().
  2. Using d.nextSyncPoint.Store instead of CompareAndSwap introduces a race condition. Since emitSyncPointEventIfNeeded can be called concurrently (e.g., from a scan worker and the table trigger ticker), multiple goroutines might advance nextSyncPoint simultaneously, leading to duplicate syncpoint emissions.
Suggested change
for d.enableSyncPoint && ts > d.nextSyncPoint.Load() {
commitTs := d.nextSyncPoint.Load()
if !d.changefeedStat.isSyncPointInCommitStage(commitTs) {
if ts <= commitTs {
return
}
d.changefeedStat.tryEnterSyncPointPrepare(commitTs)
if !d.changefeedStat.isSyncPointInCommitStage(commitTs) {
return
}
} else if ts < commitTs {
return
}
nextSyncPoint := oracle.GoTimeToTS(oracle.GetTimeFromTS(commitTs).Add(d.syncPointInterval))
// Advance nextSyncPoint with CAS so concurrent send paths cannot emit the same
// syncpoint twice or move nextSyncPoint backward.
if !d.nextSyncPoint.CompareAndSwap(commitTs, nextSyncPoint) {
d.nextSyncPoint.Store(oracle.GoTimeToTS(oracle.GetTimeFromTS(commitTs).Add(d.syncPointInterval)))
for d.enableSyncPoint {
commitTs := d.nextSyncPoint.Load()
if ts < commitTs {
return
}
nextSyncPoint := oracle.GoTimeToTS(oracle.GetTimeFromTS(commitTs).Add(d.syncPointInterval))
if !d.nextSyncPoint.CompareAndSwap(commitTs, nextSyncPoint) {
continue
}

return
}
c.emitSyncPointEventIfNeeded(e.FinishedTs, d, remoteID)
c.emitSyncPointEventIfNeeded(e.FinishedTs, d, remoteID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The order of DDL and SyncPoint emission has been reversed. Previously, DDL events were sent before SyncPoint events if they shared the same timestamp, which is the expected order for the maintainer and dispatcher. Now, emitSyncPointEventIfNeeded is called before sending the DDL event. If e.FinishedTs matches the next syncpoint, the syncpoint will be emitted first (assuming the loop condition is fixed to >=).

@asddongmen
Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 13, 2026

@asddongmen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review e38bc9d link true /test pull-error-log-review
pull-cdc-mysql-integration-light e38bc9d link true /test pull-cdc-mysql-integration-light
pull-cdc-storage-integration-heavy e38bc9d link true /test pull-cdc-storage-integration-heavy
pull-cdc-mysql-integration-heavy e38bc9d link true /test pull-cdc-mysql-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant