maintainer,heartbeatpb: add drain target plumbing#4759
maintainer,heartbeatpb: add drain target plumbing#4759hongyunyan wants to merge 3 commits intomasterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds dispatcher drain-target protocol, manager-side coordination, maintainer and scheduler drain-state APIs, messaging support, and tests to propagate and acknowledge dispatcher drain targets and enforce monotonic epoch semantics. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,200,255,0.5)
participant Coordinator
end
rect rgba(200,255,200,0.5)
participant Manager
participant MaintainerSet
participant Maintainer
participant Scheduler
end
Coordinator->>Manager: TypeSetDispatcherDrainTargetRequest(target_node_id, target_epoch)
Manager->>Manager: tryUpdateDispatcherDrainTarget(target, epoch)
Manager->>MaintainerSet: applyDispatcherDrainTarget(target, epoch)
MaintainerSet->>Maintainer: SetDispatcherDrainTarget(target, epoch)
Maintainer->>Scheduler: controller.SetDispatcherDrainTarget(target, epoch)
Scheduler->>Scheduler: Update DrainState snapshot & scheduling rules
Manager->>Coordinator: TypeNodeHeartbeatRequest(dispatcher_drain_target_node_id, dispatcher_drain_target_epoch)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a dispatcher drain mechanism, allowing the coordinator to instruct maintainer managers to drain specific nodes. It adds new protobuf definitions for drain progress and requests, updates the maintainer manager to handle these requests, and implements a node-scoped drain state. I have provided feedback regarding the use of log.Panic for input validation, which could lead to process crashes, and suggested ensuring that node heartbeats are sent to acknowledge drain target updates even when the update is a duplicate or stale.
| err := json.Unmarshal(req.Config, info) | ||
| if err != nil { | ||
| log.Panic("decode changefeed fail", zap.Error(err)) | ||
| } |
There was a problem hiding this comment.
Using log.Panic when unmarshaling the changefeed configuration from a network message is risky. If the coordinator (or a malicious actor) sends a malformed configuration, it will crash the entire maintainer manager process. It is better to log an error and return a failure status or ignore the request.
| if req.CheckpointTs == 0 { | ||
| log.Panic("add maintainer with invalid checkpointTs", | ||
| zap.Stringer("changefeedID", changefeedID), | ||
| zap.Uint64("checkpointTs", req.CheckpointTs), | ||
| zap.Any("info", info)) | ||
| } |
| if !m.node.tryUpdateDispatcherDrainTarget(target, req.TargetEpoch) { | ||
| return | ||
| } |
There was a problem hiding this comment.
If tryUpdateDispatcherDrainTarget returns false (e.g., because the update is a duplicate or stale), the manager currently returns immediately without sending a heartbeat. For better responsiveness and to ensure the coordinator receives an acknowledgement even for retries or stale messages, it is recommended to always send a node heartbeat if the request is from the authorized coordinator.
| if !m.node.tryUpdateDispatcherDrainTarget(target, req.TargetEpoch) { | |
| return | |
| } | |
| if m.node.tryUpdateDispatcherDrainTarget(target, req.TargetEpoch) { | |
| log.Info("dispatcher drain target updated", | |
| zap.Stringer("targetNodeID", target), | |
| zap.Uint64("targetEpoch", req.TargetEpoch)) | |
| m.maintainers.applyDispatcherDrainTarget(target, req.TargetEpoch) | |
| } | |
| // A manager-level heartbeat is the authoritative acknowledgement that this | |
| // node has applied the latest drain target, even when no maintainers exist. | |
| m.sendNodeHeartbeat(true) |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@maintainer/maintainer_manager_maintainers.go`:
- Around line 88-92: The log currently emits the whole rejected control message
(zap.Any("request", msg)) which can leak serialized changefeed config and
secrets; change the warning in the coordinator ID check (m.coordinatorID !=
msg.From) to avoid logging msg itself and instead log only safe metadata fields
from the AddMaintainerRequest such as msg.Type, msg.From (use zap.Stringer if
appropriate) and msg.ChangefeedID so the message body/config is never written to
logs.
- Around line 74-75: Register the maintainer in the registry atomically before
seeding it from a snapshot: instead of calling getDispatcherDrainTarget() and
then m.maintainers.handleAddMaintainer(...) which seeds the maintainer before it
is visible, first perform an atomic register (e.g., use a per-changefeed lock or
m.maintainers.LoadOrStore equivalent) to ensure only one add wins and the
maintainer becomes visible, then apply the snapshot/drain-target returned by
getDispatcherDrainTarget() to that already-registered maintainer; update the
same pattern at the other occurrences referenced (the blocks around lines
156-185 and 258-262) to eliminate the separate Load/Store window and prevent
duplicate-add and stale-target races.
- Around line 168-177: The code currently panics on json.Unmarshal(req.Config,
info) errors and when req.CheckpointTs == 0 (using log.Panic), which can crash
the node; instead, detect these two conditions (the json.Unmarshal failure for
req.Config into info, and req.CheckpointTs == 0 for the incoming
req/changefeedID), log a structured error (use log.Error or log.Warn with
zap.Error(err), zap.Stringer("changefeedID", changefeedID),
zap.Uint64("checkpointTs", req.CheckpointTs), zap.Any("info", info)) and reject
the add-maintainer request by returning an error (or otherwise signaling failure
to the caller) rather than calling log.Panic so the process does not exit and
the bad request is dropped/reported.
- Around line 277-279: Replace the bare return of ctx.Err() in the select block
with a wrapped error using errors.Trace to preserve stack information; i.e.,
when handling <-ctx.Done() capture the error (e := ctx.Err() or err :=
ctx.Err()) and return errors.Trace(err) instead of returning ctx.Err() directly
(update the select case in maintainer_manager_maintainers.go where the code
currently returns ctx.Err()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 470fe784-acff-4162-88e8-31176a2ebaca
⛔ Files ignored due to path filters (1)
heartbeatpb/heartbeat.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (9)
heartbeatpb/heartbeat.protomaintainer/maintainer.gomaintainer/maintainer_controller.gomaintainer/maintainer_manager.gomaintainer/maintainer_manager_maintainers.gomaintainer/maintainer_manager_node.gomaintainer/node_liveness_test.gomaintainer/scheduler/drain_common.gopkg/messaging/message.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
maintainer/maintainer_manager_maintainers.go (1)
176-180: Wrap the decode error before logging it.This reject path logs the raw
json.Unmarshalerror, so it loses the stack context the repo expects on library failures. Usingerrors.Trace(err)here keeps the current behavior but preserves the stack trace in the log.🩹 Small fix
log.Error("ignore add maintainer request with invalid config", zap.Stringer("changefeedID", changefeedID), zap.Int("configBytes", len(req.Config)), - zap.Error(err)) + zap.Error(errors.Trace(err)))Based on learnings: Applies to **/*.go : When an error comes from a third-party or library call in Go, wrap it immediately with
errors.Trace(err)orerrors.WrapError(...)to attach a stack trace; upstream callers should propagate wrapped errors without wrapping again.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_manager_maintainers.go` around lines 176 - 180, The json.Unmarshal error is logged raw; wrap it with errors.Trace before logging to preserve stack context — change the error argument in the reject path around the json.Unmarshal in the block that currently calls log.Error("ignore add maintainer request with invalid config", zap.Stringer("changefeedID", changefeedID), zap.Int("configBytes", len(req.Config)), zap.Error(err)) to use a traced error (e.g. traced := errors.Trace(err) and pass zap.Error(traced) or inline zap.Error(errors.Trace(err))) so the log records the wrapped error with stack info while keeping the existing log fields (changefeedID, configBytes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@maintainer/maintainer_manager_maintainers.go`:
- Around line 88-100: The code is indexing msg.Message[0] and doing unchecked
type assertions in the coordinator-mismatch block (around the m.coordinatorID !=
msg.From handling and the cases for messaging.TypeAddMaintainerRequest and
messaging.TypeRemoveMaintainerRequest); guard against empty or unexpected
payloads by first checking len(msg.Message) > 0 and then performing type
assertions using the "ok" form (e.g., v, ok :=
msg.Message[0].(*heartbeatpb.AddMaintainerRequest)) and, on failure, append a
descriptive field (or log) indicating a missing/invalid payload and skip adding
the changefeedID instead of panicking; apply the same checks for the other cases
referenced (including the similar logic at lines ~105-109).
---
Nitpick comments:
In `@maintainer/maintainer_manager_maintainers.go`:
- Around line 176-180: The json.Unmarshal error is logged raw; wrap it with
errors.Trace before logging to preserve stack context — change the error
argument in the reject path around the json.Unmarshal in the block that
currently calls log.Error("ignore add maintainer request with invalid config",
zap.Stringer("changefeedID", changefeedID), zap.Int("configBytes",
len(req.Config)), zap.Error(err)) to use a traced error (e.g. traced :=
errors.Trace(err) and pass zap.Error(traced) or inline
zap.Error(errors.Trace(err))) so the log records the wrapped error with stack
info while keeping the existing log fields (changefeedID, configBytes).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 17428b83-c9d7-48e5-81d1-9797660fed85
📒 Files selected for processing (5)
maintainer/maintainer_manager_maintainers.gomaintainer/maintainer_manager_node.gomaintainer/node_liveness_test.gomaintainer/scheduler/drain_common.gomaintainer/scheduler/drain_common_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- maintainer/maintainer_manager_node.go
- maintainer/scheduler/drain_common.go
- maintainer/node_liveness_test.go
| if m.coordinatorID != msg.From { | ||
| fields := []zap.Field{ | ||
| zap.String("type", msg.Type.String()), | ||
| zap.Stringer("coordinatorID", m.coordinatorID), | ||
| zap.Stringer("from", msg.From), | ||
| } | ||
| switch msg.Type { | ||
| case messaging.TypeAddMaintainerRequest: | ||
| changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.AddMaintainerRequest).Id) | ||
| fields = append(fields, zap.Stringer("changefeedID", changefeedID)) | ||
| case messaging.TypeRemoveMaintainerRequest: | ||
| changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.RemoveMaintainerRequest).Id) | ||
| fields = append(fields, zap.Stringer("changefeedID", changefeedID)) |
There was a problem hiding this comment.
Validate the control-message payload before indexing msg.Message[0].
Line 96, Line 99, and Line 107 all assume the first payload entry exists and has the expected protobuf type. An empty or mismatched internal message will still panic this manager goroutine before it can be ignored, including the coordinator-mismatch path.
🔧 Minimal hardening sketch
switch msg.Type {
case messaging.TypeAddMaintainerRequest:
- changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.AddMaintainerRequest).Id)
+ if len(msg.Message) != 1 {
+ log.Warn("ignore malformed add maintainer request",
+ zap.Int("messageCount", len(msg.Message)))
+ return nil
+ }
+ req, ok := msg.Message[0].(*heartbeatpb.AddMaintainerRequest)
+ if !ok || req == nil {
+ log.Warn("ignore malformed add maintainer request")
+ return nil
+ }
+ changefeedID := common.NewChangefeedIDFromPB(req.Id)
fields = append(fields, zap.Stringer("changefeedID", changefeedID))
case messaging.TypeRemoveMaintainerRequest:
- changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.RemoveMaintainerRequest).Id)
+ if len(msg.Message) != 1 {
+ log.Warn("ignore malformed remove maintainer request",
+ zap.Int("messageCount", len(msg.Message)))
+ return nil
+ }
+ req, ok := msg.Message[0].(*heartbeatpb.RemoveMaintainerRequest)
+ if !ok || req == nil {
+ log.Warn("ignore malformed remove maintainer request")
+ return nil
+ }
+ changefeedID := common.NewChangefeedIDFromPB(req.Id)
fields = append(fields, zap.Stringer("changefeedID", changefeedID))
}Also applies to: 105-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@maintainer/maintainer_manager_maintainers.go` around lines 88 - 100, The code
is indexing msg.Message[0] and doing unchecked type assertions in the
coordinator-mismatch block (around the m.coordinatorID != msg.From handling and
the cases for messaging.TypeAddMaintainerRequest and
messaging.TypeRemoveMaintainerRequest); guard against empty or unexpected
payloads by first checking len(msg.Message) > 0 and then performing type
assertions using the "ok" form (e.g., v, ok :=
msg.Message[0].(*heartbeatpb.AddMaintainerRequest)) and, on failure, append a
descriptive field (or log) indicating a missing/invalid payload and skip adding
the changefeedID instead of panicking; apply the same checks for the other cases
referenced (including the similar logic at lines ~105-109).
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
maintainer/maintainer_manager_maintainers.go (1)
99-115:⚠️ Potential issue | 🟡 MinorValidate control-message payload before indexing
msg.Message[0].Lines 107 and 110 perform unchecked type assertions that will panic if the message is malformed. This was flagged in a past review but the validation is still missing in the coordinator-mismatch logging path.
🛡️ Suggested hardening
if m.coordinatorID != msg.From { fields := []zap.Field{ zap.String("type", msg.Type.String()), zap.Stringer("coordinatorID", m.coordinatorID), zap.Stringer("from", msg.From), } + if len(msg.Message) == 0 { + log.Warn("ignore invalid coordinator id with empty payload", fields...) + return nil + } switch msg.Type { case messaging.TypeAddMaintainerRequest: - changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.AddMaintainerRequest).Id) + req, ok := msg.Message[0].(*heartbeatpb.AddMaintainerRequest) + if ok && req != nil { + changefeedID := common.NewChangefeedIDFromPB(req.Id) + fields = append(fields, zap.Stringer("changefeedID", changefeedID)) + } - fields = append(fields, zap.Stringer("changefeedID", changefeedID)) case messaging.TypeRemoveMaintainerRequest: - changefeedID := common.NewChangefeedIDFromPB(msg.Message[0].(*heartbeatpb.RemoveMaintainerRequest).Id) + req, ok := msg.Message[0].(*heartbeatpb.RemoveMaintainerRequest) + if ok && req != nil { + changefeedID := common.NewChangefeedIDFromPB(req.Id) + fields = append(fields, zap.Stringer("changefeedID", changefeedID)) + } - fields = append(fields, zap.Stringer("changefeedID", changefeedID)) } log.Warn("ignore invalid coordinator id", fields...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maintainer/maintainer_manager_maintainers.go` around lines 99 - 115, The coordinator-mismatch logging path does unchecked indexing and type assertions on msg.Message which can panic; before accessing msg.Message[0] in the block that handles messaging.TypeAddMaintainerRequest and messaging.TypeRemoveMaintainerRequest, validate that msg.Message is non-nil and has at least one element and perform safe type assertions (e.g. using a type switch or the "v, ok := msg.Message[0].(*heartbeatpb.AddMaintainerRequest)" pattern) and only append changefeedID when the cast succeeds; if validation fails, log a safe fallback (e.g. include message length or raw type) instead of panicking — apply these checks around the branches that reference AddMaintainerRequest and RemoveMaintainerRequest and use m.coordinatorID, msg.Type, and msg.From in the unchanged part of the log.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@maintainer/maintainer_manager_maintainers.go`:
- Around line 234-238: Race can occur when calling
p.registry.Store(changefeedID, maintainer) causing an existing maintainer to be
overwritten and leaked; replace this with LoadOrStore: create the new maintainer
via NewMaintainerForRemove(changefeedID, p.conf, p.nodeInfo, p.taskScheduler,
req.KeyspaceId), call actual, loaded := p.registry.LoadOrStore(changefeedID,
newMaintainer), and if loaded is true then close/stop the newly created
maintainer to avoid leaking goroutines and use actual.(MaintainerType) as the
maintainer; if loaded is false use the new maintainer as stored.
In `@maintainer/maintainer_manager_node.go`:
- Around line 156-180: The handler onSetDispatcherDrainTargetRequest currently
does an unchecked type assertion into req :=
msg.Message[0].(*heartbeatpb.SetDispatcherDrainTargetRequest) which can panic if
msg.Message is empty or contains the wrong type; add guard logic to validate
msg.Message has at least one element and perform the type assertion with the
comma-ok idiom (req, ok :=
msg.Message[0].(*heartbeatpb.SetDispatcherDrainTargetRequest)), log a
warning/error if validation fails (including msg.From and coordinatorID context)
and return early instead of panicking; keep the existing flow (target,
tryUpdateDispatcherDrainTarget, applyDispatcherDrainTarget, sendNodeHeartbeat)
unchanged when the assertion succeeds.
---
Duplicate comments:
In `@maintainer/maintainer_manager_maintainers.go`:
- Around line 99-115: The coordinator-mismatch logging path does unchecked
indexing and type assertions on msg.Message which can panic; before accessing
msg.Message[0] in the block that handles messaging.TypeAddMaintainerRequest and
messaging.TypeRemoveMaintainerRequest, validate that msg.Message is non-nil and
has at least one element and perform safe type assertions (e.g. using a type
switch or the "v, ok := msg.Message[0].(*heartbeatpb.AddMaintainerRequest)"
pattern) and only append changefeedID when the cast succeeds; if validation
fails, log a safe fallback (e.g. include message length or raw type) instead of
panicking — apply these checks around the branches that reference
AddMaintainerRequest and RemoveMaintainerRequest and use m.coordinatorID,
msg.Type, and msg.From in the unchanged part of the log.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 26fd4fd4-f6af-4420-88e9-1e25be118028
⛔ Files ignored due to path filters (1)
heartbeatpb/heartbeat.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (4)
heartbeatpb/heartbeat.protomaintainer/maintainer_manager_maintainers.gomaintainer/maintainer_manager_node.gomaintainer/scheduler/drain_common.go
|
|
||
| // It's cascade remove, we should remove the dispatcher from all node. | ||
| // Here we create a maintainer to run the remove dispatcher logic. | ||
| maintainer = NewMaintainerForRemove(changefeedID, p.conf, p.nodeInfo, p.taskScheduler, req.KeyspaceId) | ||
| p.registry.Store(changefeedID, maintainer) |
There was a problem hiding this comment.
Cascade remove uses Store which can leak a maintainer on race.
If two cascade-remove requests race for the same changefeed, Store at line 238 can overwrite a previously stored maintainer without closing it, leaking its goroutines. Consider using LoadOrStore and handling the loaded case.
🛡️ Suggested fix
// It's cascade remove, we should remove the dispatcher from all node.
// Here we create a maintainer to run the remove dispatcher logic.
- maintainer = NewMaintainerForRemove(changefeedID, p.conf, p.nodeInfo, p.taskScheduler, req.KeyspaceId)
- p.registry.Store(changefeedID, maintainer)
+ newMaintainer := NewMaintainerForRemove(changefeedID, p.conf, p.nodeInfo, p.taskScheduler, req.KeyspaceId)
+ actual, loaded := p.registry.LoadOrStore(changefeedID, newMaintainer)
+ if loaded {
+ // Another request raced us; close our redundant instance and use the existing one.
+ newMaintainer.Close()
+ }
+ maintainer = actual
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@maintainer/maintainer_manager_maintainers.go` around lines 234 - 238, Race
can occur when calling p.registry.Store(changefeedID, maintainer) causing an
existing maintainer to be overwritten and leaked; replace this with LoadOrStore:
create the new maintainer via NewMaintainerForRemove(changefeedID, p.conf,
p.nodeInfo, p.taskScheduler, req.KeyspaceId), call actual, loaded :=
p.registry.LoadOrStore(changefeedID, newMaintainer), and if loaded is true then
close/stop the newly created maintainer to avoid leaking goroutines and use
actual.(MaintainerType) as the maintainer; if loaded is false use the new
maintainer as stored.
There was a problem hiding this comment.
This write path is serialized today, so I do not think the Store here can race with another cascade-remove Store in the current design.
All add/remove control messages first go through recvMessages() into m.msgCh, and Manager.Run() processes handleMessage() from that channel on a single goroutine. That means handleRemoveMaintainer() is not executed concurrently for two control messages of the same changefeed. The other concurrent paths touching registry (dispatchMaintainerMessage, heartbeat building, cleanup iteration) only Load/Range; they do not create another remove maintainer for the same key.
So while LoadOrStore would be a reasonable defensive hardening if we ever make the write path concurrent, I do not see a real goroutine leak/overwrite race here with the current manager event-loop model.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check Manager.Run() and recvMessages() to verify serial processing
rg -n "func.*Manager.*Run\|func.*recvMessages\|m\.msgCh\|msgCh" maintainer/maintainer_manager.go -A5 | head -100
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Broader search for the message channel and run loop
rg -n "msgCh\|recvMessages\|handleMessage" maintainer/ -A3 | head -150
echo "---"
# Look at Manager.Run implementation
ast-grep --pattern 'func (m *Manager) Run($$$) {
$$$
}'
Length of output: 3
🏁 Script executed:
#!/bin/bash
# List all files in maintainer directory
fd . maintainer/ --type f
echo "---"
# Try grep on all go files
grep -rn "msgCh\|recvMessages\|handleMessage\|func.*Run" maintainer/ 2>/dev/null | head -80
Length of output: 4799
@hongyunyan, you're right — the code confirms your explanation. Manager.Run() processes m.msgCh in a single select loop (lines 134–135 of maintainer_manager.go), and recvMessages only pushes incoming messages onto that channel. Since handleMessage (and therefore handleRemoveMaintainer) is always called from that one goroutine, two cascade-remove requests for the same changefeed cannot execute concurrently, and the Store at line 238 cannot race with another Store for the same key.
My original concern was valid only if the write path were concurrent. Since it's fully serialized in the current design, the LoadOrStore hardening is unnecessary today. Thanks for the clear explanation!
✏️ Learnings added
Learnt from: hongyunyan
Repo: pingcap/ticdc PR: 4759
File: maintainer/maintainer_manager_maintainers.go:234-238
Timestamp: 2026-04-09T01:20:43.816Z
Learning: In the TiCDC maintainer package (`maintainer/maintainer_manager.go`), the Manager's write path for add/remove maintainer requests is fully serialized: `recvMessages` pushes all coordinator control messages onto `m.msgCh` (buffered channel), and `Manager.Run()` processes them one-at-a-time via a single `select` loop calling `handleMessage`. Therefore, `handleRemoveMaintainer` (and `handleAddMaintainer`) are never executed concurrently, and `sync.Map.Store` in `handleRemoveMaintainer` cannot race with another concurrent `Store` for the same changefeed key in the current design. Race-safety concerns about `LoadOrStore` vs `Store` for the cascade-remove path are theoretical/future-proofing only.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
What problem does this PR solve?
The dispatcher-drain work split out of #4190 still mixes the drain target protocol, maintainer runtime behavior, coordinator scheduling, and public API orchestration in one review path. This PR extracts the drain target protocol and maintainer-manager plumbing into a dedicated review unit so reviewers can focus on how the target is propagated and acknowledged before looking at scheduler behavior.
Issue Number: ref #4190
What is changed and how it works?
Background:
master.#4523still mixed protocol, runtime scheduling, and API orchestration.masterand keeps only the drain target plumbing.Motivation:
Summary:
SetDispatcherDrainTargetRequestHow it works:
Check List
Tests
go test ./maintainer -run 'TestSetDispatcherDrainTarget|TestSetNodeLiveness'go test ./pkg/messaging ./heartbeatpbQuestions
Will it cause performance regression or break compatibility?
This PR does not add a user-facing API by itself. It only adds the internal protocol and manager-level drain target propagation needed by the later runtime and API layers.
Do you need to update user documentation, design documentation or monitoring documentation?
No additional user-facing documentation is needed for this split. It is an internal decomposition of the drain-capture implementation.
Release note
Summary by CodeRabbit
New Features
Improvements
Tests