fix(*): merge operator inconsistent after maintainer move by wlwilliamx · Pull Request #3769 · pingcap/ticdc

wlwilliamx · 2025-12-23T10:05:28Z

What problem does this PR solve?

Issue Number: close #4763

Maintainer bootstrap rebuilds running spans from dispatcher-manager snapshots, but before this change it did not fully rebuild in-flight operator state after maintainer failover.

There were two gaps:

bootstrap recovery for create/remove-based operators still had corner cases, so add/remove/move/split could leave duplicate tasks, ghost spans, or stale removed dispatchers after maintainer restart;
merge uses MergeDispatcherRequest instead of ScheduleDispatcherRequest, but dispatcher managers did not persist in-flight merge requests in bootstrap responses, so a maintainer move could lose the merge intent entirely and leave the source dispatchers or merged dispatcher stuck in an intermediate state.

As a result, table scheduling might not converge automatically after maintainer failover, especially when failover happens while add/remove/move/split/merge is still in progress.

What is changed and how it works?

Persist in-flight merge requests in dispatcher manager state and include them in MaintainerBootstrapResponse.
Restore merge operators during bootstrap by rebuilding source and merged replica sets from bootstrap spans, re-creating occupy operators, and resuming the merge state machine with the original merged dispatcher ID.
Harden bootstrap recovery for create/remove/move/split operators:
- normalize missing span/schema information from bootstrap span snapshots;
- skip stale create requests for tables that are already absent from the schema-store snapshot;
- rebuild missing replica sets for remove requests and bind them to the reporting node when needed;
- avoid creating duplicate tasks when restored operators have already populated spanController;
- only mark move/split replicas absent if the task still exists, so concurrent DDL does not reintroduce ghost spans;
- only finish remove operators on Stopped or Removed, not transient states such as WaitingMerge.
Extend maintainer_failover_when_operator integration coverage to cover merge recovery and make the test runnable against non-default local ports for isolated local reruns.

Check List

Tests

Integration test
Manual test

Questions

Will it cause performance regression or break compatibility?

No. This only changes bootstrap recovery for in-flight scheduling state after maintainer failover and does not change steady-state scheduling behavior or external compatibility.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix an issue where maintainer failover could leave in-flight scheduling or merge operators inconsistent and prevent table convergence.

Summary by CodeRabbit

New Features
- Added in-flight merge operator tracking and restoration during maintainer failover to enhance fault tolerance.
- Bootstrap process now recovers merge dispatcher state from persisted snapshots.
Tests
- Enhanced integration tests for merge operator failover scenarios with multiple table operations.
Chores
- Updated test configurations and utilities for improved failover testing.

…e, split after maintainer restart

…st-due-to-maintainer-move-operator-lost

ti-chi-bot · 2025-12-23T10:05:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hongyunyan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-12-23T10:06:07Z

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and consistency of dispatcher operations, particularly in distributed environments where maintainer nodes might restart or fail over. By introducing explicit tracking of different operator types and implementing comprehensive recovery logic for in-flight operations during bootstrap, the system can now maintain a more accurate state and seamlessly resume complex tasks like merging dispatchers, even after disruptions. This change addresses potential inconsistencies that could arise from untracked operations, leading to a more reliable and fault-tolerant system.

Highlights

Operator Type Tracking: Introduced a new OperatorType enum (Add, Remove, Move, Split, Merge) and integrated it into ScheduleDispatcherRequest messages, allowing for more granular tracking of dispatcher operations.
In-flight Operator Recovery: Implemented mechanisms to track and restore in-flight operators, including merge operations, during system bootstrap. This ensures that ongoing operations are not lost and can resume correctly after a node restart or maintainer failover.
Dispatcher Manager Enhancements: Added currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to the DispatcherManager to store active operators and merge requests. Logic was added to manage these maps during dispatcher lifecycle events and message processing.
Bootstrap Response Updates: The MaintainerBootstrapResponse now includes lists of Operators and MergeOperators, enabling the maintainer to reconstruct the state of ongoing operations upon recovery.
Improved Operator Handling Logic: Refined the handling of add, remove, move, and split operators to correctly utilize the new OperatorType and ensure proper state management, especially during concurrent scheduling and cleanup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

…r-inconsistent-after-maintainer-move

wlwilliamx · 2025-12-23T10:19:51Z

/test pull-cdc-mysql-integration-heavy

wlwilliamx · 2025-12-23T10:19:57Z

/test pull-cdc-mysql-integration-light

gemini-code-assist

Code Review

This pull request introduces a mechanism to track and restore in-flight dispatcher operations (add, remove, move, split, merge) during maintainer failover and bootstrap. Key changes include adding currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to DispatcherManager to store ongoing operations, and updating the protobuf definitions to include OperatorType and lists of in-flight operators in the MaintainerBootstrapResponse. The HeartBeatCollector now tracks merge operators, and the SchedulerDispatcherRequestHandler prevents concurrent operations on the same span by checking these new operator maps. During bootstrap, the maintainer now restores these in-flight operators. Review comments highlight that the OperatorType should be correctly propagated and not hardcoded, especially for move and split operations, and suggest simplifying the concurrent operator check logic by potentially unifying the currentOperatorMap and redoCurrentOperatorMap.

gemini-code-assist · 2025-12-23T10:20:51Z

maintainer/maintainer_controller_bootstrap.go

+			case heartbeatpb.ScheduleAction_Create:
+				switch req.OperatorType {
+				case heartbeatpb.OperatorType_O_Add, heartbeatpb.OperatorType_O_Move, heartbeatpb.OperatorType_O_Split:
+					op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)


When restoring an add operator, the original operator type from the request (req.OperatorType) should be preserved. Hardcoding OperatorType_O_Add here will cause move and split operators to be incorrectly restored as simple add operators, breaking the operator restoration logic.

Suggested change

op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)

op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType)

gemini-code-assist · 2025-12-23T10:20:51Z

maintainer/operator/operator_move.go

+		return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)
 	case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:
-		return m.replicaSet.NewRemoveDispatcherMessage(m.origin)
+		return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)


The add and remove parts of a move operation should both be typed as O_Move. Using O_Add and O_Remove is incorrect and will break operator restoration logic on maintainer failover, as the new maintainer will not recognize these as parts of a single move operation.

Suggested change

return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)

case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:

return m.replicaSet.NewRemoveDispatcherMessage(m.origin)

return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)

return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Move)

case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:

return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Move)

gemini-code-assist · 2025-12-23T10:20:51Z

downstreamadapter/dispatchermanager/helper.go

+		_, exists := dispatcherManager.currentOperatorMap.Load(operatorKey)
+		if exists {
+			log.Warn("operator key exists, skip this request",
+				zap.String("changefeedID", req.ChangefeedID.String()),
+				zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
+				zap.String("operatorKey", operatorKey),
+				zap.Any("operator", req),
+			)
+			continue
+		}
+		_, redoExists := dispatcherManager.redoCurrentOperatorMap.Load(operatorKey)
+		if redoExists {
+			log.Warn("redo operator key exists, skip this request",
+				zap.String("changefeedID", req.ChangefeedID.String()),
+				zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
+				zap.String("operatorKey", operatorKey),
+				zap.Any("operator", req),
+			)
+			continue
+		}


The logic to prevent concurrent operators on the same span or dispatcher is split between checking currentOperatorMap and redoCurrentOperatorMap. This could be simplified by using a single map for both, with a composite key or value to distinguish between redo and normal modes. This would reduce code duplication and make the logic easier to follow.

wlwilliamx · 2025-12-23T11:47:10Z

/test pull-cdc-mysql-integration-heavy

wlwilliamx · 2025-12-23T11:47:18Z

/test pull-cdc-mysql-integration-light

…r-inconsistent-after-maintainer-move # Conflicts: # downstreamadapter/dispatchermanager/dispatcher_manager.go # downstreamadapter/dispatchermanager/dispatcher_manager_info.go # downstreamadapter/dispatchermanager/dispatcher_manager_redo.go # downstreamadapter/dispatchermanager/helper.go # downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go # heartbeatpb/heartbeat.pb.go # heartbeatpb/heartbeat.proto # maintainer/maintainer_controller.go # maintainer/maintainer_controller_bootstrap.go # maintainer/maintainer_controller_helper.go # maintainer/maintainer_manager_test.go # maintainer/maintainer_test.go # maintainer/operator/operator_add.go # maintainer/operator/operator_move.go # maintainer/operator/operator_remove.go # maintainer/replica/replication_span.go # maintainer/replica/replication_span_test.go

coderabbitai · 2026-03-10T13:10:00Z

📝 Walkthrough

Walkthrough

Adds tracking, persistence, and restoration of in-flight merge dispatcher requests: a new mergeOperatorMap on DispatcherManager, APIs to track/cleanup/get merge requests, heartbeat wiring to include merge operators in bootstrap responses, maintainer-side restoration of merges after failover, and tests to validate restore behavior.

Changes

Cohort / File(s)	Summary
DispatcherManager core `downstreamadapter/dispatchermanager/dispatcher_manager.go`, `downstreamadapter/dispatchermanager/dispatcher_manager_merge.go`	Add `mergeOperatorMap sync.Map` and public methods: `TrackMergeOperator`, `RemoveMergeOperator`, `MaybeCleanupMergeOperator`, `GetMergeOperators`; add `cloneMergeDispatcherRequest` helper for safe storage.
Heartbeat & wiring `downstreamadapter/dispatchermanager/heartbeat_collector.go`, `downstreamadapter/dispatchermanager/helper.go`	Persist dispatcher managers in `dispatcherManagers sync.Map`; on receiving `MergeDispatcherRequest` call `TrackMergeOperator`; call `MaybeCleanupMergeOperator` when merge returns nil.
Task lifecycle cleanup `downstreamadapter/dispatchermanager/task.go`	Call `RemoveMergeOperator(mergedDispatcherID)` after merge completes or aborts to clean tracked merge requests.
Bootstrap/Orchestration & protobuf `downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go`, `heartbeatpb/heartbeat.proto`	Include manager.GetMergeOperators() in bootstrap responses; add `repeated MergeDispatcherRequest merge_operators` to `MaintainerBootstrapResponse`.
Maintainer bootstrap restore `maintainer/maintainer_controller_bootstrap.go`	Add `restoreCurrentMergeOperators`, `buildTableSplitMap`, `buildMergedSpanFromReplicas`; change `findHoles` behavior; integrate merge restoration into `FinishBootstrap` to reconstruct and start restored merge operators.
Operator restoration APIs `maintainer/operator/operator_controller.go`, `maintainer/operator/operator_merge.go`	Add `Controller.AddRestoredMergeOperator` and `NewRestoredMergeDispatcherOperator` to create/register restored merge dispatcher operators from existing replica sets.
Tests & scripts `tests/integration_tests/maintainer_failover_when_operator/run.sh`, `tests/integration_tests/_utils/get_table_id`, `tests/integration_tests/maintainer_failover_when_operator/conf/diff_config.toml`, `maintainer/maintainer_controller_test.go`	Extend integration test to cover merge scenario and bootstrap restore; make upstream endpoint dynamic; add CLI helpers and polling/wait helpers; add unit tests for bootstrap merge restoration and hole-finding behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Helper/Receiver
    participant DM as DispatcherManager
    participant HC as HeartBeatCollector
    participant DO as DispatcherOrchestrator
    participant Maintainer as Maintainer Controller

    Client->>DM: TrackMergeOperator(req)
    activate DM
    DM->>DM: store cloned req in mergeOperatorMap
    deactivate DM

    HC->>DM: GetMergeOperators()
    activate DM
    DM-->>HC: tracked merge requests
    deactivate DM

    DO->>DM: GetMergeOperators() during bootstrap
    activate DM
    DM-->>DO: tracked merge requests
    deactivate DM

    DO->>Maintainer: BootstrapResponse(with merge_operators)
    activate Maintainer
    Maintainer->>Maintainer: restoreCurrentMergeOperators()
    Maintainer->>Maintainer: AddRestoredMergeOperator(...) & start
    deactivate Maintainer

sequenceDiagram
    participant Task as Merge Task
    participant DM as DispatcherManager

    Task->>Task: doMerge() or abortMerge() completes
    activate Task
    Task->>DM: RemoveMergeOperator(mergedDispatcherID)
    activate DM
    DM->>DM: delete mergeOperatorMap entry
    deactivate DM
    Task->>Task: finalize/cleanup
    deactivate Task

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix(*): move, add, remove, split operator inconsistent after maintainer move #3691: related in-flight dispatcher-operator tracking and bootstrap restoration work.
maintainer: delay finishing merge occupy operators #4543: related changes to merge operator lifecycle and restore semantics.
coordinator,maintainer: Fixed bootstrap might fail to succeed in frequent maintainer scheduling #4114: related maintainer bootstrap behavior and guards around recovery.

Suggested labels

lgtm, approved

Suggested reviewers

wk989898
hongyunyan

Poem

🐇 I hopped through heartbeats, clutching merge in paw,

Saved spans that drifted when the maintainer saw.
When failover thundered and bootstraps did wake,
I stitched back the merges for convergence' sake.
Hooray — restored, they hop and never fall.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'fix(*): merge operator inconsistent after maintainer move' clearly and specifically describes the main bug being fixed—merge operator inconsistency after maintainer failover.
Description check	✅ Passed	The PR description includes all required sections: issue number (close `#4763`), detailed explanation of the problem, comprehensive changes list, test coverage (integration and manual tests), answers to compatibility and documentation questions, and a quality release note.
Linked Issues check	✅ Passed	The PR directly addresses all objectives in linked issue `#4763`: persisting in-flight merge requests in bootstrap responses, restoring merge operators during bootstrap, hardening create/remove/move/split recovery, and extending integration test coverage for merge failover scenarios.
Out of Scope Changes check	✅ Passed	All changes are within scope of issue `#4763`: merge operator persistence, merge operator restoration, bootstrap recovery hardening for scheduling operators, test infrastructure updates, and integration test expansion for merge failover. No unrelated changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

downstreamadapter/dispatchermanager/heartbeat_collector.go (1)
294-296: Duplicate tracking—consider consolidating.

This TrackMergeOperator call is redundant with the one in MergeDispatcherRequestHandler.Handle (helper.go, line 788). Both track the same request. While idempotent and harmless, consolidating to a single call would reduce confusion. Consider removing this early tracking and relying solely on the handler, or vice versa.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/heartbeat_collector.go` around lines 294
- 296, The call to DispatcherManager.TrackMergeOperator in the heartbeat
collector is a duplicate of the tracking performed in
MergeDispatcherRequestHandler.Handle (MergeDispatcherRequestHandler.Handle
already tracks the same merge request); remove the early TrackMergeOperator
invocation from the heartbeat_collector code path (the
manager.(*DispatcherManager).TrackMergeOperator(mergeDispatcherRequest) call) so
the single canonical tracking remains in MergeDispatcherRequestHandler.Handle,
keeping idempotency while avoiding confusing redundant calls.
downstreamadapter/dispatchermanager/helper.go (1)
788-800: Redundant TrackMergeOperator call.

TrackMergeOperator is already invoked in RecvMessages (heartbeat_collector.go, line 295) before pushing to the dynamic stream. Calling it again here is idempotent but unnecessary. Consider removing one of the calls to avoid confusion—keeping it in the handler (here) is preferable since it's closer to the actual merge logic and ensures tracking even if the request arrives through a different path.

The nil-check and MaybeCleanupMergeOperator call for failed merges is correct and ensures no stale entries remain in mergeOperatorMap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper.go` around lines 788 - 800,
Redundant call to TrackMergeOperator: remove the earlier invocation in
RecvMessages (heartbeat_collector.go) so that tracking happens only in this
handler before calling MergeDispatcher; keep the
TrackMergeOperator(dispatcherManager.TrackMergeOperator(mergeDispatcherRequest.MergeDispatcherRequest))
call here, ensure MergeDispatcher(...) and the nil-check that calls
MaybeCleanupMergeOperator(...) remain unchanged, and run tests to confirm no
behavior change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager_merge.go`:
- Around line 24-29: The TrackMergeOperator method currently stores requests
whose protobuf ID decodes to a zero-valued DispatcherID, causing all such
requests to collide; before storing into e.mergeOperatorMap call
mergedID.IsZero() and return if true (mirror the guard used in
MaybeCleanupMergeOperator), and do the same check wherever mergeOperatorMap is
written (e.g., in the second occurrence around MaybeCleanupMergeOperator
handling) so that cloneMergeDispatcherRequest(req) and the
mergeOperatorMap.Store only run for non-zero mergedID values.

In `@maintainer/maintainer_controller_bootstrap.go`:
- Around line 133-136: The restoreCurrentMergeOperators call always uses
buildTableSplitMap(tables) but must use a mode-specific split map because
restoreCurrentMergeOperators rebuilds merges based on mergeReq.Mode; when
redoStartTs != startTs redo-only tables need the redo-mode splitEnabled values.
Modify the bootstrap path that calls restoreCurrentMergeOperators so it computes
and passes a split map appropriate for the merge mode (e.g., choose
buildTableSplitMap(tables) for default mode and buildTableSplitMap(redoTables)
or a map derived for redo mode when mergeReq.Mode indicates redo), ensuring
restoreCurrentMergeOperators receives the mode-specific split map rather than
the always-normal tables view.
- Around line 956-963: The code assumes spanInfo and mergedSpanInfo are non-nil
and dereferences spanInfo.Span.TableID (and mergedSpanInfo.Span.TableID) causing
panics for malformed bootstrap entries; update the bootstrap handling (where
indexBootstrapSpans / spanInfoByID are read) to nil-check the outer entry and
its inner Span before accessing fields: e.g., before using spanInfo.Span.TableID
or passing spanInfo to
spanController.ShouldEnableSplit/createSpanReplication/AddReplicatingSpan,
return or skip the entry (set sourceComplete=false or continue) when
spanInfo==nil || spanInfo.Span==nil (and similarly for mergedSpanInfo), and
ensure any subsequent logic that relies on a non-nil Span only runs after these
guards.

---

Nitpick comments:
In `@downstreamadapter/dispatchermanager/heartbeat_collector.go`:
- Around line 294-296: The call to DispatcherManager.TrackMergeOperator in the
heartbeat collector is a duplicate of the tracking performed in
MergeDispatcherRequestHandler.Handle (MergeDispatcherRequestHandler.Handle
already tracks the same merge request); remove the early TrackMergeOperator
invocation from the heartbeat_collector code path (the
manager.(*DispatcherManager).TrackMergeOperator(mergeDispatcherRequest) call) so
the single canonical tracking remains in MergeDispatcherRequestHandler.Handle,
keeping idempotency while avoiding confusing redundant calls.

In `@downstreamadapter/dispatchermanager/helper.go`:
- Around line 788-800: Redundant call to TrackMergeOperator: remove the earlier
invocation in RecvMessages (heartbeat_collector.go) so that tracking happens
only in this handler before calling MergeDispatcher; keep the
TrackMergeOperator(dispatcherManager.TrackMergeOperator(mergeDispatcherRequest.MergeDispatcherRequest))
call here, ensure MergeDispatcher(...) and the nil-check that calls
MaybeCleanupMergeOperator(...) remain unchanged, and run tests to confirm no
behavior change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a0daa28-8fc8-4a2c-b916-05a14a5723d0

📥 Commits

Reviewing files that changed from the base of the PR and between ed4bfaa and ba7dd6d.

⛔ Files ignored due to path filters (1)

heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (10)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_merge.go
downstreamadapter/dispatchermanager/heartbeat_collector.go
downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatchermanager/task.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
heartbeatpb/heartbeat.proto
maintainer/maintainer_controller_bootstrap.go
maintainer/operator/operator_controller.go
maintainer/operator/operator_merge.go

coderabbitai · 2026-03-10T13:24:38Z

downstreamadapter/dispatchermanager/dispatcher_manager_merge.go

+func (e *DispatcherManager) TrackMergeOperator(req *heartbeatpb.MergeDispatcherRequest) {
+	if req == nil || req.MergedDispatcherID == nil {
+		return
+	}
+	mergedID := common.NewDispatcherIDFromPB(req.MergedDispatcherID)
+	e.mergeOperatorMap.Store(mergedID.String(), cloneMergeDispatcherRequest(req))


⚠️ Potential issue | 🟠 Major

Reject zero-valued merged dispatcher IDs before touching mergeOperatorMap.

A non-nil protobuf ID can still decode to DispatcherID{}. Right now those requests all collapse to the same key, and the bootstrap side later treats that zero ID as a real merged dispatcher, so one malformed request can poison recovery state.

🔒 Suggested guard

func (e *DispatcherManager) TrackMergeOperator(req *heartbeatpb.MergeDispatcherRequest) { if req == nil || req.MergedDispatcherID == nil { return } mergedID := common.NewDispatcherIDFromPB(req.MergedDispatcherID) + if mergedID.IsZero() { + log.Warn("merge operator has invalid merged dispatcher ID", + zap.Stringer("changefeedID", e.changefeedID)) + return + } e.mergeOperatorMap.Store(mergedID.String(), cloneMergeDispatcherRequest(req)) }

Apply the same mergedID.IsZero() guard in MaybeCleanupMergeOperator.

Also applies to: 38-43

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@downstreamadapter/dispatchermanager/dispatcher_manager_merge.go` around lines 24 - 29, The TrackMergeOperator method currently stores requests whose protobuf ID decodes to a zero-valued DispatcherID, causing all such requests to collide; before storing into e.mergeOperatorMap call mergedID.IsZero() and return if true (mirror the guard used in MaybeCleanupMergeOperator), and do the same check wherever mergeOperatorMap is written (e.g., in the second occurrence around MaybeCleanupMergeOperator handling) so that cloneMergeDispatcherRequest(req) and the mergeOperatorMap.Store only run for non-zero mergedID values.

coderabbitai · 2026-03-10T13:24:38Z

maintainer/maintainer_controller_bootstrap.go

+	// Restore merge operators after task state is rebuilt from bootstrap spans/operators.
+	// Merge restoration needs the per-dispatcher task map from buildTaskInfo, but must run
+	// before we discard any leftover working tasks as dropped-table artifacts.
+	if err := c.restoreCurrentMergeOperators(allNodesResp, buildTableSplitMap(tables)); err != nil {


⚠️ Potential issue | 🟡 Minor

Pass a mode-specific split map into merge restoration.

Line 136 always calls restoreCurrentMergeOperators with the normal tables view, but the restore path rebuilds both default and redo merges via mergeReq.Mode. If redoStartTs and startTs diverge, redo-only tables get recreated with the wrong splitEnabled flag after failover.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_controller_bootstrap.go` around lines 133 - 136, The restoreCurrentMergeOperators call always uses buildTableSplitMap(tables) but must use a mode-specific split map because restoreCurrentMergeOperators rebuilds merges based on mergeReq.Mode; when redoStartTs != startTs redo-only tables need the redo-mode splitEnabled values. Modify the bootstrap path that calls restoreCurrentMergeOperators so it computes and passes a split map appropriate for the merge mode (e.g., choose buildTableSplitMap(tables) for default mode and buildTableSplitMap(redoTables) or a map derived for redo mode when mergeReq.Mode indicates redo), ensuring restoreCurrentMergeOperators receives the mode-specific split map rather than the always-normal tables view.

coderabbitai · 2026-03-10T13:24:38Z

maintainer/maintainer_controller_bootstrap.go

+					spanInfo := spanInfoByID[dispatcherID]
+					if spanInfo == nil {
+						sourceComplete = false
+						break
+					}
+					splitEnabled := spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID])
+					replicaSet = c.createSpanReplication(spanInfo, nodeID, splitEnabled)
+					spanController.AddReplicatingSpan(replicaSet)


⚠️ Potential issue | 🔴 Critical

Guard nil Span before dereferencing bootstrap entries.

indexBootstrapSpans only filters nil IDs, not nil Spans. The new recovery path then reads spanInfo.Span.TableID / mergedSpanInfo.Span.TableID in several places, so one malformed bootstrap snapshot can panic the maintainer during bootstrap.

🛡️ Suggested guard

replicaSet := spanController.GetTaskByID(dispatcherID) if replicaSet == nil { spanInfo := spanInfoByID[dispatcherID] - if spanInfo == nil { + if spanInfo == nil || spanInfo.Span == nil { sourceComplete = false break } splitEnabled := spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID]) replicaSet = c.createSpanReplication(spanInfo, nodeID, splitEnabled) @@ - mergedSpanInfo := spanInfoByID[mergedDispatcherID] + mergedSpanInfo := spanInfoByID[mergedDispatcherID] + if mergedSpanInfo != nil && mergedSpanInfo.Span == nil { + log.Warn("merge operator missing merged span, skip restoring it", + zap.String("nodeID", nodeID.String()), + zap.String("changefeed", resp.ChangefeedID.String()), + zap.String("dispatcher", mergedDispatcherID.String())) + continue + }

Also applies to: 972-974, 1001-1004, 1025-1027

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_controller_bootstrap.go` around lines 956 - 963, The code assumes spanInfo and mergedSpanInfo are non-nil and dereferences spanInfo.Span.TableID (and mergedSpanInfo.Span.TableID) causing panics for malformed bootstrap entries; update the bootstrap handling (where indexBootstrapSpans / spanInfoByID are read) to nil-check the outer entry and its inner Span before accessing fields: e.g., before using spanInfo.Span.TableID or passing spanInfo to spanController.ShouldEnableSplit/createSpanReplication/AddReplicatingSpan, return or skip the entry (set sourceComplete=false or continue) when spanInfo==nil || spanInfo.Span==nil (and similarly for mergedSpanInfo), and ensure any subsequent logic that relies on a non-nil Span only runs after these guards.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration_tests/_utils/get_table_id`:
- Around line 19-22: The retry loop currently captures a literal "null" or HTTP
error as a non-empty $id; change the command substitution to fail on HTTP errors
and missing fields by using curl -fsS and jq -er, e.g. assign id with id=$(curl
-fsS
"http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}"
| jq -er .id) and keep the existing if check to only succeed when the pipeline
exits 0; this ensures HTTP failures and missing .id cause the command to fail
and trigger retries instead of returning invalid IDs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a00d0fe-7c5c-4db9-a2cb-38afa66de214

📥 Commits

Reviewing files that changed from the base of the PR and between ba7dd6d and 8e3b14a.

📒 Files selected for processing (3)

tests/integration_tests/_utils/get_table_id
tests/integration_tests/maintainer_failover_when_operator/conf/diff_config.toml
tests/integration_tests/maintainer_failover_when_operator/run.sh

coderabbitai · 2026-03-11T11:49:51Z

tests/integration_tests/_utils/get_table_id

+	id=$(curl "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq .id)
 	if [ -n "$id" ]; then
 		echo $id
 		exit 0


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the full get_table_id file find . -name "get_table_id" -type f

Repository: pingcap/ticdc

Length of output: 102

🏁 Script executed:

# Read the file with line numbers cat -n tests/integration_tests/_utils/get_table_id

Repository: pingcap/ticdc

Length of output: 849

🏁 Script executed:

# Check the jq behavior with -er flag when field is missing # Create a test JSON without the 'id' field echo '{"name": "test"}' | jq .id echo "Exit code: $?"

Repository: pingcap/ticdc

Length of output: 74

🏁 Script executed:

# Test jq with -er flag (what the proposed fix suggests) echo '{"name": "test"}' | jq -er '.id' echo "Exit code for -er: $?"

Repository: pingcap/ticdc

Length of output: 82

🏁 Script executed:

# Test jq -er with valid id echo '{"id": 123}' | jq -er '.id' echo "Exit code for valid id: $?"

Repository: pingcap/ticdc

Length of output: 86

🏁 Script executed:

# Test the proposed fix logic to confirm it correctly rejects null # Simulating missing .id echo '{"name": "test"}' | { if id=$(jq -er '.id' 2>/dev/null); then echo "SUCCESS: $id"; else echo "FAILED as expected"; fi; } # Simulating valid .id echo '{"id": 123}' | { if id=$(jq -er '.id' 2>/dev/null); then echo "SUCCESS: $id"; else echo "FAILED"; fi; }

Repository: pingcap/ticdc

Length of output: 88

🏁 Script executed:

# Test curl -fsS behavior with invalid endpoint (simulated with a failing jq) # to confirm it properly fails when piped (echo "" | jq -er '.id') 2>&1 echo "Exit code: $?" # Confirm the original code would incorrectly pass (echo "" | jq .id) echo "Original would pass check: [ -n \"null\" ] = true" [ -n "null" ] && echo "CORRECT - string null passes the test"

Repository: pingcap/ticdc

Length of output: 155

Make the retry loop reject missing IDs and HTTP errors.

The current code uses plain curl | jq .id, which means a missing .id field outputs the string "null" and passes the non-empty check, causing the script to exit with an invalid table ID. Additionally, this bypasses the retry logic since the condition succeeds when it shouldn't. Use curl -fsS to fail on HTTP errors and jq -er to fail when the field is missing.

Proposed fix

- id=$(curl "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq .id) - if [ -n "$id" ]; then - echo $id + if id=$(curl -fsS "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq -er '.id'); then + echo "$id" exit 0 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

id=$(curl "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq .id)

if [ -n "$id" ]; then

echo $id

exit 0

if id=$(curl -fsS "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq -er '.id'); then

echo "$id"

exit 0

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration_tests/_utils/get_table_id` around lines 19 - 22, The retry loop currently captures a literal "null" or HTTP error as a non-empty $id; change the command substitution to fail on HTTP errors and missing fields by using curl -fsS and jq -er, e.g. assign id with id=$(curl -fsS "http://${UP_TIDB_HOST:-127.0.0.1}:${UP_TIDB_STATUS:-10080}/schema/${dbName}/${tableName}" | jq -er .id) and keep the existing if check to only succeed when the pipeline exits 0; this ensures HTTP failures and missing .id cause the command to fail and trigger retries instead of returning invalid IDs.

…r-inconsistent-after-maintainer-move

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/integration_tests/maintainer_failover_when_operator/run.sh (1)
387-398: Consider quoting variables in run_sql calls.

Static analysis (SC2086) flags unquoted ${UP_TIDB_HOST} and ${UP_TIDB_PORT}. While unlikely to cause issues in this controlled test environment, quoting prevents unexpected word splitting:
run_sql "..." "${UP_TIDB_HOST}" "${UP_TIDB_PORT}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_tests/maintainer_failover_when_operator/run.sh` around
lines 387 - 398, The run_sql calls in the script pass unquoted variables
(${UP_TIDB_HOST}, ${UP_TIDB_PORT}, ${DOWN_TIDB_HOST}, ${DOWN_TIDB_PORT}) which
can trigger word-splitting; update each call that uses run_sql (e.g., the calls
creating/splitting tables and inserting rows and the check_table_exists calls)
to quote those parameters so they are passed as single arguments (use "..." for
the SQL string and quote each host/port variable such as "${UP_TIDB_HOST}"
"${UP_TIDB_PORT}" and similarly for the DOWN_* variables).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maintainer/maintainer_controller_bootstrap.go`:
- Around line 1028-1030: The sort comparator for sourceReplicaSets can panic if
a ReplicaSet has a nil Span; ensure you guard against nils either when
collecting entries from spanController.GetTaskByID or in the sort comparator:
when adding to sourceReplicaSets (the code path that calls
spanController.GetTaskByID) skip or normalize entries whose Span is nil, or
change the sort.Slice comparator to safely handle nil Spans by treating nil as
less/greater and comparing StartKey only when both Spans are non-nil; reference
sourceReplicaSets, Span.StartKey, spanController.GetTaskByID, and the sort.Slice
comparator to locate where to add the nil checks.

---

Nitpick comments:
In `@tests/integration_tests/maintainer_failover_when_operator/run.sh`:
- Around line 387-398: The run_sql calls in the script pass unquoted variables
(${UP_TIDB_HOST}, ${UP_TIDB_PORT}, ${DOWN_TIDB_HOST}, ${DOWN_TIDB_PORT}) which
can trigger word-splitting; update each call that uses run_sql (e.g., the calls
creating/splitting tables and inserting rows and the check_table_exists calls)
to quote those parameters so they are passed as single arguments (use "..." for
the SQL string and quote each host/port variable such as "${UP_TIDB_HOST}"
"${UP_TIDB_PORT}" and similarly for the DOWN_* variables).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ea6e38d-7d60-4f18-9fe9-bd161db3db1d

📥 Commits

Reviewing files that changed from the base of the PR and between 8e3b14a and 927581d.

⛔ Files ignored due to path filters (1)

heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (6)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/task.go
heartbeatpb/heartbeat.proto
maintainer/maintainer_controller_bootstrap.go
maintainer/operator/operator_merge.go
tests/integration_tests/maintainer_failover_when_operator/run.sh

🚧 Files skipped from review as they are similar to previous changes (3)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/task.go
maintainer/operator/operator_merge.go

coderabbitai · 2026-04-08T05:29:42Z

maintainer/maintainer_controller_bootstrap.go

+			sort.Slice(sourceReplicaSets, func(i, j int) bool {
+				return bytes.Compare(sourceReplicaSets[i].Span.StartKey, sourceReplicaSets[j].Span.StartKey) < 0
+			})


⚠️ Potential issue | 🟡 Minor

Potential nil dereference in sort if replica set has nil Span.

sourceReplicaSets may include replica sets from spanController.GetTaskByID() (line 964) without validating that Span is non-nil. The sort then accesses Span.StartKey, which would panic on a malformed replica set.

Consider adding a nil Span check when adding replica sets from GetTaskByID:

🛡️ Suggested guard

replicaSet := spanController.GetTaskByID(dispatcherID) - if replicaSet == nil { + if replicaSet == nil || replicaSet.Span == nil { spanInfo := spanInfoByID[dispatcherID]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_controller_bootstrap.go` around lines 1028 - 1030, The sort comparator for sourceReplicaSets can panic if a ReplicaSet has a nil Span; ensure you guard against nils either when collecting entries from spanController.GetTaskByID or in the sort comparator: when adding to sourceReplicaSets (the code path that calls spanController.GetTaskByID) skip or normalize entries whose Span is nil, or change the sort.Slice comparator to safely handle nil Spans by treating nil as less/greater and comparing StartKey only when both Spans are non-nil; reference sourceReplicaSets, Span.StartKey, spanController.GetTaskByID, and the sort.Slice comparator to locate where to add the nil checks.

wlwilliamx · 2026-04-08T06:20:56Z

/test all

wlwilliamx · 2026-04-08T08:16:24Z

/retest

wlwilliamx · 2026-04-08T10:23:57Z

/test all

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

maintainer/maintainer_controller_bootstrap.go (2)
131-135: ⚠️ Potential issue | 🟡 Minor

Use mode-specific table metadata for merge recovery.

restoreCurrentMergeOperators restores both default-mode and redo-mode merges, but this call always passes the default tables snapshot. When redoStartTs != startTs, redo merges are rebuilt with the wrong table splitability and can even miss redo-only tables entirely. Please thread the per-mode table map into merge recovery instead of hard-coding buildTableSplitMap(tables).

Also applies to: 925-928
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_controller_bootstrap.go` around lines 131 - 135,
restoreCurrentMergeOperators currently always receives
buildTableSplitMap(tables), causing redo-mode merges (when redoStartTs !=
startTs) to be rebuilt with incorrect or missing table splitability; change the
call sites (including the one invoking restoreCurrentMergeOperators with
allNodesResp) to pass a mode-specific table map instead of the default
buildTableSplitMap(tables): compute and thread the per-mode table map (e.g.,
buildTableSplitMap(tablesForMode) or a new buildTableSplitMapForMode) based on
redoStartTs vs startTs and use that map when calling
restoreCurrentMergeOperators so redo-mode and default-mode merges use their
correct table metadata (also apply the same change to the other occurrence
referenced in the review).
964-972: ⚠️ Potential issue | 🔴 Critical

Guard nil bootstrap spans before restoring merge state.

indexBootstrapSpans only filters nil IDs, so spanInfo.Span and mergedSpanInfo.Span can still be nil here. This path then reads Span.TableID and sorts by replicaSet.Span.StartKey, which will panic bootstrap on a malformed snapshot instead of skipping the bad merge request.

Also applies to: 978-983, 1010-1013, 1028-1029, 1034-1036
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_controller_bootstrap.go` around lines 964 - 972, Guard
against nil Span objects before restoring merge state: in the bootstrap path
where you fetch replicaSet := spanController.GetTaskByID(dispatcherID) and then
read spanInfo.Span.TableID or use replicaSet.Span.StartKey (and similarly for
mergedSpanInfo.Span), add nil checks that skip the merge/restore for that entry
(set sourceComplete=false or continue) if spanInfo.Span or mergedSpanInfo.Span
is nil; update the blocks around createSpanReplication calls and the
sorting/merge logic (references: replicaSet, spanInfo.Span, mergedSpanInfo.Span,
createSpanReplication, indexBootstrapSpans, spanController.ShouldEnableSplit) to
return/skip gracefully instead of dereferencing nil so malformed snapshots are
ignored.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maintainer/maintainer_controller_bootstrap.go`:
- Around line 964-973: The code is recreating replication for spans by using
tableSplitMap[...] without checking existence, which can resurrect dropped
tables; update the span restore paths (the block using
spanController.GetTaskByID, spanInfoByID, ShouldEnableSplit, and
createSpanReplication) to perform a comma-ok lookup on tableSplitMap for
spanInfo.Span.TableID before calling ShouldEnableSplit or createSpanReplication,
and if the table is missing from the snapshot treat it as not-yet-complete (set
sourceComplete = false and break or return as the surrounding logic expects);
apply the same existence check to the other similar lookup sites noted (the
other blocks that call ShouldEnableSplit/createSpanReplication with
tableSplitMap[...] values) so stale merge journals do not recreate dropped
tables.

---

Duplicate comments:
In `@maintainer/maintainer_controller_bootstrap.go`:
- Around line 131-135: restoreCurrentMergeOperators currently always receives
buildTableSplitMap(tables), causing redo-mode merges (when redoStartTs !=
startTs) to be rebuilt with incorrect or missing table splitability; change the
call sites (including the one invoking restoreCurrentMergeOperators with
allNodesResp) to pass a mode-specific table map instead of the default
buildTableSplitMap(tables): compute and thread the per-mode table map (e.g.,
buildTableSplitMap(tablesForMode) or a new buildTableSplitMapForMode) based on
redoStartTs vs startTs and use that map when calling
restoreCurrentMergeOperators so redo-mode and default-mode merges use their
correct table metadata (also apply the same change to the other occurrence
referenced in the review).
- Around line 964-972: Guard against nil Span objects before restoring merge
state: in the bootstrap path where you fetch replicaSet :=
spanController.GetTaskByID(dispatcherID) and then read spanInfo.Span.TableID or
use replicaSet.Span.StartKey (and similarly for mergedSpanInfo.Span), add nil
checks that skip the merge/restore for that entry (set sourceComplete=false or
continue) if spanInfo.Span or mergedSpanInfo.Span is nil; update the blocks
around createSpanReplication calls and the sorting/merge logic (references:
replicaSet, spanInfo.Span, mergedSpanInfo.Span, createSpanReplication,
indexBootstrapSpans, spanController.ShouldEnableSplit) to return/skip gracefully
instead of dereferencing nil so malformed snapshots are ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 922efb64-a529-4fb1-b3f0-8209a883f5cd

📥 Commits

Reviewing files that changed from the base of the PR and between 927581d and e124c26.

📒 Files selected for processing (2)

maintainer/maintainer_controller_bootstrap.go
maintainer/maintainer_controller_test.go

coderabbitai · 2026-04-08T10:33:33Z

maintainer/maintainer_controller_bootstrap.go

+				replicaSet := spanController.GetTaskByID(dispatcherID)
+				if replicaSet == nil {
+					spanInfo := spanInfoByID[dispatcherID]
+					if spanInfo == nil {
+						sourceComplete = false
+						break
+					}
+					splitEnabled := spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID])
+					replicaSet = c.createSpanReplication(spanInfo, nodeID, splitEnabled)
+					spanController.AddReplicatingSpan(replicaSet)


⚠️ Potential issue | 🟠 Major

Don't resurrect dropped tables from stale merge journals.

This path is missing the schema-snapshot existence check that restoreCurrentWorkingCreateAction already has. Every tableSplitMap[...] lookup here discards the ok bit, so “table missing from the snapshot” is treated the same as “existing but not splittable”. If a table was dropped before failover but a stale merge request/span snapshot is still present, bootstrap will recreate the source and merged replica sets here and start a merge operator anyway, bringing back ghost spans for a dropped table.

Also applies to: 981-984, 1010-1018, 1039-1063

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_controller_bootstrap.go` around lines 964 - 973, The code is recreating replication for spans by using tableSplitMap[...] without checking existence, which can resurrect dropped tables; update the span restore paths (the block using spanController.GetTaskByID, spanInfoByID, ShouldEnableSplit, and createSpanReplication) to perform a comma-ok lookup on tableSplitMap for spanInfo.Span.TableID before calling ShouldEnableSplit or createSpanReplication, and if the table is missing from the snapshot treat it as not-yet-complete (set sourceComplete = false and break or return as the surrounding logic expects); apply the same existence check to the other similar lookup sites noted (the other blocks that call ShouldEnableSplit/createSpanReplication with tableSplitMap[...] values) so stale merge journals do not recreate dropped tables.

wlwilliamx · 2026-04-09T01:39:38Z

/retest

ti-chi-bot · 2026-04-09T02:07:21Z

@wlwilliamx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cdc-mysql-integration-light	`e124c26`	link	true	`/test pull-cdc-mysql-integration-light`
pull-cdc-mysql-integration-heavy	`e124c26`	link	true	`/test pull-cdc-mysql-integration-heavy`
pull-cdc-kafka-integration-heavy	`e124c26`	link	true	`/test pull-cdc-kafka-integration-heavy`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wlwilliamx added 14 commits December 18, 2025 13:57

fix(scheduler): ensure span consistency for operator add, remove, mov…

2686e4a

…e, split after maintainer restart

fix a typo

7288257

fix a log typo

b9637e2

handle err for restoreCurrentWorkingOperators

ce070cc

modify a comment

b0929db

Merge remote-tracking branch 'upstream/master' into fix/dispatcher-lo…

d8732bc

…st-due-to-maintainer-move-operator-lost

resolve conflicts

050e13f

fix newRemoveDispatcherOperator node id

a7be949

handle return value for AddOperator and add some logs for it

c9891d8

add some logs for dispatcher manager store current working operators

8c4e099

fix remove operator nil span

75638b2

Merge remote-tracking branch 'upstream/master' into fix/dispatcher-lo…

82637b9

…st-due-to-maintainer-move-operator-lost

remove enabled split

26f5a83

fix(*): ensure merge operator consistency when maintainer restart

d1d8752

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 23, 2025

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 23, 2025

Merge remote-tracking branch 'upstream/master' into fix/merge-operato…

53fb47d

…r-inconsistent-after-maintainer-move

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

fix maintainer uts

b8473b1

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

add test case

8e3b14a

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix/merge-operato…

927581d

…r-inconsistent-after-maintainer-move

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

ti-chi-bot bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue do-not-merge/needs-triage-completed labels Apr 8, 2026

maintainer: tolerate bootstrap overlap during merge

e124c26

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

	op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)
	op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType)

Conversation

wlwilliamx commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

wlwilliamx commented Apr 8, 2026

Uh oh!

wlwilliamx commented Apr 8, 2026

wlwilliamx commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading