Skip to content

*: reduce log print frequency#4701

Draft
wk989898 wants to merge 3 commits intopingcap:masterfrom
wk989898:log-0403
Draft

*: reduce log print frequency#4701
wk989898 wants to merge 3 commits intopingcap:masterfrom
wk989898:log-0403

Conversation

@wk989898
Copy link
Copy Markdown
Collaborator

@wk989898 wk989898 commented Apr 3, 2026

What problem does this PR solve?

Issue Number: close #4734, ref #4691

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes
    • Rate-limited repeated warning logs across dispatcher, event collection, region request handling, schema store, and maintainer subsystems to reduce log spam while preserving error handling and control flow.
    • Changed schema store upper-bound persistence log from warning to debug level for clearer operational visibility.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 3, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3aceshowhand for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f4a32d40-b085-4326-8dd1-540e3e11130f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements comprehensive rate-limiting for repeated warning and error logs across 14 files using atomic timestamps and compare-and-swap operations. The changes suppress duplicate log emissions within configured intervals to prevent disk space exhaustion and system stalls.

Changes

Cohort / File(s) Summary
Dispatcher Components
downstreamadapter/dispatcher/basic_dispatcher.go, downstreamadapter/dispatchermanager/dispatcher_manager.go, downstreamadapter/eventcollector/dispatcher_stat.go, downstreamadapter/eventcollector/event_collector.go
Added rate-limiting for warning logs related to error channel full, dispatcher state removal, stale events, handshake issues, and dispatcher request retry failures using atomic timestamp fields and throttling helpers with configurable intervals (10s).
Log Puller Components
logservice/logpuller/region_req_cache.go, logservice/logpuller/region_request_worker.go, logservice/logpuller/subscription_client.go
Introduced throttling mechanisms for region request warnings (overwrite, stale deletion), connection issues, unexpected region events, and resolve-lock failures. Refactored control flow in region_request_worker.go to use switch statement for cleaner region state handling.
Schema Store Components
logservice/schemastore/disk_format.go, logservice/schemastore/persist_storage.go, logservice/schemastore/schema_store.go
Added rate-limited retry logging in disk_format.go and schema_store.go; changed log level from Warn to Debug in persist_storage.go for unchanged upper bound status to reduce noise.
System Components
maintainer/maintainer.go, pkg/messaging/message_center.go, pkg/messaging/remote_target.go, server/watcher/module_node_manager.go
Implemented rate-limiting for progress advance blocked warnings, connection issues, stream readiness warnings, and coordinator error warnings. Refactored missing node logging in maintainer.go to aggregate per-node warnings into single throttled warnings with counts and IDs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • pingcap/ticdc#4461 — Modifies BasicDispatcher.handleEvents in the same file with DDL table span logging changes, creating potential merge conflicts or interaction concerns.

Suggested labels

lgtm, approved, size/XXL, release-note-none

Suggested reviewers

  • lidezhu
  • asddongmen
  • flowbehappy

Poem

🐰 Logs grow wild when things go wrong,
Filling disks with tales too long,
We rate-limit, suppress the roar—
Ten seconds quiet, then warn some more! 🎯

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description follows the template structure and references issue #4691 (disk may full if print too many log), but the critical 'What is changed and how it works?' section is completely empty, leaving the implementation approach unclear. Add a detailed explanation of how the changes reduce log frequency (e.g., rate-limiting mechanisms, intervals, specific logs affected) and clarify which modules implement this solution.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'reduce log print frequency' directly and clearly summarizes the main objective of the changeset: implementing rate-limiting mechanisms across multiple files to suppress repeated log emissions.
Linked Issues check ✅ Passed The changes implement rate-limiting mechanisms across 12 files to reduce log frequency, directly addressing issue #4691's objective of preventing disk exhaustion from excessive log output.
Out of Scope Changes check ✅ Passed All code changes are narrowly focused on adding log throttling/rate-limiting mechanisms. No unrelated changes or refactoring outside the scope of reducing log frequency were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wk989898 wk989898 changed the title *: reduce the frequence of log print *: reduce log print frequency Apr 3, 2026
@wk989898 wk989898 marked this pull request as draft April 3, 2026 06:45
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements log rate-limiting across several components (including the dispatcher, maintainer, and messaging modules) to reduce log noise on hot paths, typically using a 10-second interval. The review feedback highlights that the rate-limiting logic is duplicated and inconsistent across the codebase, and suggests refactoring it into a common utility. Additionally, there are concerns regarding the robustness of the duration checks against system clock adjustments and suggestions to optimize slice allocations in the maintainer's heartbeat monitoring loops.

Comment on lines +40 to +51
func shouldLogDispatcherWarning(lastLogTime *atomic.Int64, interval time.Duration) bool {
now := time.Now().UnixNano()
for {
last := lastLogTime.Load()
if last != 0 && now-last < interval.Nanoseconds() {
return false
}
if lastLogTime.CAS(last, now) {
return true
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for rate-limiting log messages is duplicated across multiple files in this PR (e.g., basic_dispatcher.go, dispatcher_manager.go, dispatcher_stat.go, subscription_client.go, maintainer.go, remote_target.go, and module_node_manager.go).

Furthermore, the implementation is inconsistent: some files use go.uber.org/atomic (calling .CAS()), while others use sync/atomic (calling .CompareAndSwap()).

Please refactor this into a single utility function in a common package (e.g., pkg/util) to improve maintainability and ensure consistency across the codebase.

Comment on lines +44 to +46
if last != 0 && now-last < interval.Nanoseconds() {
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The duration check now - last < interval.Nanoseconds() using Unix nanoseconds is not robust against system clock adjustments. If the system clock is set backwards, the result of the subtraction can be negative, which will be less than the interval, causing logging to be suppressed until the wall clock catches up to the previously stored timestamp.

Consider using time.Since() or explicitly handling negative deltas to avoid silencing logs for extended periods during clock jumps.

}
needUpdate := false
updateCheckpointTs := true
missingNodes := make([]node.ID, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating a slice with make([]node.ID, 0) on every iteration of the ticker loop (every second) is inefficient when no nodes are missing. Using a nil slice is more efficient as it avoids allocation in the common case.

Suggested change
missingNodes := make([]node.ID, 0)
var missingNodes []node.ID


// Step 2: Apply heartbeat constraints from all nodes
updateCheckpointTs := true
missingNodes := make([]node.ID, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using var missingNodes []node.ID instead of make([]node.ID, 0) avoids an unnecessary allocation when all nodes are present.

Suggested change
missingNodes := make([]node.ID, 0)
var missingNodes []node.ID

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
pkg/messaging/remote_target.go (1)

519-536: Consider extracting the stream-ready wait loop.

The wait-for-stream-ready logic (polling with rate-limited logging) is duplicated between runSendMessages and runReceiveMessages. Consider extracting to a helper method like waitForStreamReady(ctx context.Context) error to reduce duplication.

♻️ Suggested helper extraction
// waitForStreamReady blocks until the stream is ready or the context is cancelled.
func (s *remoteMessageTarget) waitForStreamReady(ctx context.Context) error {
	for {
		if s.isReadyToSend() {
			return nil
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(500 * time.Millisecond):
			if s.shouldLogConnectionIssue(time.Now(), connectionIssueLogInterval) {
				log.Warn("remote target stream is not ready, wait and check again",
					zap.Stringer("localID", s.messageCenterID),
					zap.String("localAddr", s.localAddr),
					zap.Stringer("remoteID", s.targetId),
					zap.String("remoteAddr", s.targetAddr))
			}
		}
	}
}

Then in both runSendMessages and runReceiveMessages:

-	// wait stream ready
-	for {
-		if s.isReadyToSend() {
-			break
-		}
-		select {
-		case <-ctx.Done():
-			return ctx.Err()
-		case <-time.After(500 * time.Millisecond):
-			if s.shouldLogConnectionIssue(time.Now(), connectionIssueLogInterval) {
-				log.Warn("remote target stream is not ready, wait and check again",
-					...
-			}
-			continue
-		}
-	}
+	if err := s.waitForStreamReady(ctx); err != nil {
+		return err
+	}

Also applies to: 597-614

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/messaging/remote_target.go` around lines 519 - 536, Extract the
duplicated wait loop into a method on remoteMessageTarget named
waitForStreamReady(ctx context.Context) error that returns nil when
s.isReadyToSend() becomes true and ctx.Err() if the context is canceled; move
the polling interval (500ms) and the rate-limited logging using
s.shouldLogConnectionIssue(time.Now(), connectionIssueLogInterval) and the
existing log.Warn call into that helper; then replace the loop in
runSendMessages and runReceiveMessages with a call to s.waitForStreamReady(ctx)
and propagate the returned error.
server/watcher/module_node_manager.go (1)

86-92: CAS without retry loop may cause missed log opportunities.

Unlike the loop-based CAS pattern used in other files (dispatcher_stat.go, dispatcher_manager.go, basic_dispatcher.go), this implementation doesn't retry on CAS failure. If two goroutines race and one fails the CAS, it returns false even though it might be the first one to actually check after the interval elapsed.

This is a minor inconsistency but not a critical issue since it only affects logging frequency slightly.

🔧 Suggested fix for consistency with other files
 func (c *NodeManager) shouldLogGetCoordinatorErr(now time.Time) bool {
-	last := c.lastGetCoordinatorErrLogTime.Load()
-	if last != 0 && now.UnixNano()-last < getCoordinatorErrLogInterval.Nanoseconds() {
-		return false
+	nowNano := now.UnixNano()
+	for {
+		last := c.lastGetCoordinatorErrLogTime.Load()
+		if last != 0 && nowNano-last < getCoordinatorErrLogInterval.Nanoseconds() {
+			return false
+		}
+		if c.lastGetCoordinatorErrLogTime.CompareAndSwap(last, nowNano) {
+			return true
+		}
 	}
-	return c.lastGetCoordinatorErrLogTime.CompareAndSwap(last, now.UnixNano())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/watcher/module_node_manager.go` around lines 86 - 92, The
shouldLogGetCoordinatorErr method in NodeManager uses a one-shot CompareAndSwap
which can lose a logging opportunity under contention; update
shouldLogGetCoordinatorErr to use the same CAS-retry loop pattern found in
dispatcher_stat.go/dispatcher_manager.go/basic_dispatcher.go: repeatedly Load
lastGetCoordinatorErrLogTime, return false if now-last <
getCoordinatorErrLogInterval, otherwise attempt CompareAndSwap(last,
now.UnixNano()) in a loop until the CAS succeeds (return true) or another
goroutine advances the timestamp (in which case re-evaluate and return false if
interval not elapsed); reference NodeManager.shouldLogGetCoordinatorErr,
lastGetCoordinatorErrLogTime, and getCoordinatorErrLogInterval.
downstreamadapter/dispatcher/basic_dispatcher.go (1)

40-51: Use sync/atomic instead of go.uber.org/atomic for consistency.

This file uses go.uber.org/atomic while most other changed files use standard library sync/atomic (dispatcher_stat.go, dispatcher_manager.go, event_collector.go, and module_node_manager.go). Both approaches work correctly, but aligning on a single package across the codebase would improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcher/basic_dispatcher.go` around lines 40 - 51, The
function shouldLogDispatcherWarning currently accepts lastLogTime *atomic.Int64
and uses methods Load and CAS from go.uber.org/atomic; change it to use the
standard library by making lastLogTime a pointer to int64 (e.g., *int64) and
replace lastLogTime.Load()/lastLogTime.CAS(...) with sync/atomic calls:
atomic.LoadInt64(lastLogTime), atomic.CompareAndSwapInt64(lastLogTime, last,
now), and compute now as int64 using time.Now().UnixNano(); update any callers
that pass an *atomic.Int64 to instead pass the underlying *int64 (or adapt
initialization) so the function signature and usages consistently use
sync/atomic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@logservice/logpuller/region_request_worker.go`:
- Around line 114-119: The Recv error logging in receiveAndDispatchChangeEvents
is unthrottled and logs before handling EOF; update the Recv error handling so
that if err == io.EOF (or the stream closed fast-path) you return silently, and
for other errors wrap the log call with shouldLogLogPullerWarning using the
worker/s receiver's lastConnectionIssueLogTime (e.g.
worker.lastConnectionIssueLogTime) and logPullerWarnLogInterval; apply the same
throttling guard to the other Recv logging sites mentioned (the blocks around
the Recv call at/near receiveAndDispatchChangeEvents and the other occurrences)
so flapping stores don’t emit a log every reconnect.

In `@pkg/messaging/remote_target.go`:
- Around line 40-44: Remove the unused constant reconnectInterval from the
top-level constants in remote_target.go (currently defined as reconnectInterval
= 2 * time.Second); locate the const block that also declares
connectionIssueLogInterval, streamTypeEvent, and streamTypeCommand, and delete
the reconnectInterval entry to avoid dead code, or alternatively replace it with
a brief comment explaining its intended purpose if you expect to use it later.

---

Nitpick comments:
In `@downstreamadapter/dispatcher/basic_dispatcher.go`:
- Around line 40-51: The function shouldLogDispatcherWarning currently accepts
lastLogTime *atomic.Int64 and uses methods Load and CAS from go.uber.org/atomic;
change it to use the standard library by making lastLogTime a pointer to int64
(e.g., *int64) and replace lastLogTime.Load()/lastLogTime.CAS(...) with
sync/atomic calls: atomic.LoadInt64(lastLogTime),
atomic.CompareAndSwapInt64(lastLogTime, last, now), and compute now as int64
using time.Now().UnixNano(); update any callers that pass an *atomic.Int64 to
instead pass the underlying *int64 (or adapt initialization) so the function
signature and usages consistently use sync/atomic.

In `@pkg/messaging/remote_target.go`:
- Around line 519-536: Extract the duplicated wait loop into a method on
remoteMessageTarget named waitForStreamReady(ctx context.Context) error that
returns nil when s.isReadyToSend() becomes true and ctx.Err() if the context is
canceled; move the polling interval (500ms) and the rate-limited logging using
s.shouldLogConnectionIssue(time.Now(), connectionIssueLogInterval) and the
existing log.Warn call into that helper; then replace the loop in
runSendMessages and runReceiveMessages with a call to s.waitForStreamReady(ctx)
and propagate the returned error.

In `@server/watcher/module_node_manager.go`:
- Around line 86-92: The shouldLogGetCoordinatorErr method in NodeManager uses a
one-shot CompareAndSwap which can lose a logging opportunity under contention;
update shouldLogGetCoordinatorErr to use the same CAS-retry loop pattern found
in dispatcher_stat.go/dispatcher_manager.go/basic_dispatcher.go: repeatedly Load
lastGetCoordinatorErrLogTime, return false if now-last <
getCoordinatorErrLogInterval, otherwise attempt CompareAndSwap(last,
now.UnixNano()) in a loop until the CAS succeeds (return true) or another
goroutine advances the timestamp (in which case re-evaluate and return false if
interval not elapsed); reference NodeManager.shouldLogGetCoordinatorErr,
lastGetCoordinatorErrLogTime, and getCoordinatorErrLogInterval.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 20a72332-0801-4e3c-b9f5-51f512ee72ad

📥 Commits

Reviewing files that changed from the base of the PR and between 19400c6 and ee088fc.

📒 Files selected for processing (14)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/eventcollector/dispatcher_stat.go
  • downstreamadapter/eventcollector/event_collector.go
  • logservice/logpuller/region_req_cache.go
  • logservice/logpuller/region_request_worker.go
  • logservice/logpuller/subscription_client.go
  • logservice/schemastore/disk_format.go
  • logservice/schemastore/persist_storage.go
  • logservice/schemastore/schema_store.go
  • maintainer/maintainer.go
  • pkg/messaging/message_center.go
  • pkg/messaging/remote_target.go
  • server/watcher/module_node_manager.go

Comment on lines +114 to +119
if shouldLogLogPullerWarning(&worker.lastConnectionIssueLogTime, logPullerWarnLogInterval) {
log.Error("event feed check store version fails",
zap.Uint64("workerID", worker.workerID),
zap.String("addr", worker.store.storeAddr),
zap.Error(err))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t leave Recv failures unthrottled on the reconnect loop.

Line 221 in receiveAndDispatchChangeEvents still logs every stream Recv failure, and it does so before the EOF fast-path. On a flapping store each worker can still emit one info log per reconnect cycle, so the disk-growth problem behind issue #4691 is not fully addressed. Please either return silently on EOF and throttle the remaining log, or guard that site with shouldLogLogPullerWarning(&s.lastConnectionIssueLogTime, logPullerWarnLogInterval).

Also applies to: 181-186, 354-361

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/region_request_worker.go` around lines 114 - 119, The
Recv error logging in receiveAndDispatchChangeEvents is unthrottled and logs
before handling EOF; update the Recv error handling so that if err == io.EOF (or
the stream closed fast-path) you return silently, and for other errors wrap the
log call with shouldLogLogPullerWarning using the worker/s receiver's
lastConnectionIssueLogTime (e.g. worker.lastConnectionIssueLogTime) and
logPullerWarnLogInterval; apply the same throttling guard to the other Recv
logging sites mentioned (the blocks around the Recv call at/near
receiveAndDispatchChangeEvents and the other occurrences) so flapping stores
don’t emit a log every reconnect.

Comment on lines +40 to 44
reconnectInterval = 2 * time.Second
connectionIssueLogInterval = 10 * time.Second
streamTypeEvent = "event"
streamTypeCommand = "command"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for usages of reconnectInterval across the codebase
rg -n '\breconnectInterval\b' --type go

Repository: pingcap/ticdc

Length of output: 136


Remove the unused reconnectInterval constant at line 40.

It's defined but never referenced in the codebase. Either remove it or add a comment explaining its intended purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/messaging/remote_target.go` around lines 40 - 44, Remove the unused
constant reconnectInterval from the top-level constants in remote_target.go
(currently defined as reconnectInterval = 2 * time.Second); locate the const
block that also declares connectionIssueLogInterval, streamTypeEvent, and
streamTypeCommand, and delete the reconnectInterval entry to avoid dead code, or
alternatively replace it with a brief comment explaining its intended purpose if
you expect to use it later.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
Signed-off-by: wk989898 <nhsmwk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reduce log frequency

1 participant