Avoid log spam about cluster node failure detection by each primary #2010

hpatro · 2025-04-26T06:47:46Z

Fixes: #2076

After node failure detection/recovery and gossip by each primary, we log about the failure detection/recovery at NOTICE level which can spam the server and the behavior is quite expensive on ec2 burstable instance types. I would prefer us rolling it back to VERBOSE level.

Change was introduced in #633

Signed-off-by: Harkrishn Patro <[email protected]>

codecov · 2025-04-26T07:02:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.01%. Comparing base (0b94ca6) to head (0772c2f).
Report is 98 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2010      +/-   ##
============================================
- Coverage     71.01%   71.01%   -0.01%     
============================================
  Files           123      123              
  Lines         66033    66113      +80     
============================================
+ Hits          46892    46948      +56     
- Misses        19141    19165      +24

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.46% <100.00%> (+0.37%)`	⬆️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sarthakaggarwal97 · 2025-04-28T21:10:32Z

src/cluster_legacy.c

@@ -2409,13 +2409,13 @@ void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
            if (sender) {
                if (flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_PFAIL)) {
                    if (clusterNodeIsVotingPrimary(sender) && clusterNodeAddFailureReport(node, sender)) {
-                        serverLog(LL_NOTICE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
+                        serverLog(LL_VERBOSE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,


Should we make the level Warning? I am wondering if users depend on this log for any debugging.

Also, just out of curiosity, does changing log severity come under breaking change?

I think it might be helpful in case of small clusters but unsure how valuable it is to log it for each primary in a large cluster setup.

My suggestion is to log the state periodically every few seconds to debug better. #2011

@sarthakaggarwal97 Btw warning would increase the severity of logging. I want to reduce it.

ack! #2011 makes sense to me!

madolson · 2025-04-29T19:35:32Z

I don't feel strongly either way, so would appreciate input @enjoy-binbin once he is back from vacation.

enjoy-binbin · 2025-05-08T07:12:53Z

ok, in your environment, the nodes will frequently enter and exit the pfail/fail state? The main reason i made this change was to better track changes in node state.

hpatro · 2025-05-12T06:24:39Z

ok, in your environment, the nodes will frequently enter and exit the pfail/fail state? The main reason i made this change was to better track changes in node state.

My concern is with cluster of large size, this log statement becomes quite ineffective by logging for each primary's report and can cause spikes in CPU utilisation.

enjoy-binbin · 2025-05-12T07:00:07Z

My concern is with cluster of large size, this log statement becomes quite ineffective by logging for each primary's report and can cause spikes in CPU utilisation.

Have you already encountered this problem in a real way or is it just an issue you see in the test logs? If the node frequently pfail/fail, the log issue should be relaively less serious?

Or you are saying, a node is actually dying, it is too heavy for the cluster to print the log in each nodes (like if we have 256 shards)? I suppose that can be some cases.

Anyway, i don't have a strong opinion, lets seek other opinions. @PingXie @zuiderkwast Sorry to ping you with this small (old) change, please feel free to leave a comment.

The loglevel default is notice

# verbose (many rarely useful info, but not a mess like the debug level)
# notice (moderately verbose, what you want in production probably)

zuiderkwast · 2025-05-12T08:29:02Z

It's hard to guess the scenarios users will encounter. To say yes to a PR like this, I would like a more thorough explanation of the possible scenarios and the effect of this.

When a node is down in a 256-shard cluster, each node will detect that it's down and then they will also receive 254 gossips that say that it's not reachable. It can be too verbose, I guess. So in what scenarios do you get this?

A node has actually crashed, is overloaded or has a network problem, etc.
The admin is taking down the node on purpose, for example to upgrade it.
Anything else?

And do users rely on it? I don't know. Probably it's enough to log as NOTICE when it's marked as FAIL.

sarthakaggarwal97 · 2025-05-13T06:39:09Z

I missed that this PR exists, but I also experienced a lot of compute just going into logging. Sharing the issue here: #2076

hpatro · 2025-06-05T21:22:50Z

Have you already encountered this problem in a real way or is it just an issue you see in the test logs? If the node frequently pfail/fail, the log issue should be relatively less serious?

Most of this is testing Valkey in large cluster setup and we want to avoid unnecessary resource utilization as much as possible. It's not about the node flip-flopping but in case of AZ failure we risk logging lot of information which is not really valuable.

Or you are saying, a node is actually dying, it is too heavy for the cluster to print the log in each nodes (like if we have 256 shards)? I suppose that can be some cases.

Yes, a single node going down is still fine. an entire AZ going down, is very expensive based on our testing.

hpatro · 2025-06-05T21:23:43Z

@enjoy-binbin / @zuiderkwast Shall we merge this and work on #2011 for better observability of cluster state?

enjoy-binbin

ok, i think i can take this

madolson

Didn't approve earlier, but I'm okay with it

…alkey-io#2010) After node failure detection/recovery and gossip by each primary, we log about the failure detection/recovery at NOTICE level which can spam the server and the behavior is quite expensive on ec2 burstable instance types. I would prefer us rolling it back to VERBOSE level. Change was introduced in valkey-io#633 Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: chzhoo <[email protected]>

…alkey-io#2010) After node failure detection/recovery and gossip by each primary, we log about the failure detection/recovery at NOTICE level which can spam the server and the behavior is quite expensive on ec2 burstable instance types. I would prefer us rolling it back to VERBOSE level. Change was introduced in valkey-io#633 Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: shanwan1 <[email protected]>

Avoid log spam about cluster node failure detection by each primary

0772c2f

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro requested review from enjoy-binbin and madolson April 26, 2025 06:47

sarthakaggarwal97 reviewed Apr 28, 2025

View reviewed changes

enjoy-binbin approved these changes Jun 6, 2025

View reviewed changes

madolson approved these changes Jun 6, 2025

View reviewed changes

sarthakaggarwal97 approved these changes Jun 6, 2025

View reviewed changes

hpatro merged commit 988297d into valkey-io:unstable Jun 6, 2025
51 checks passed

madolson mentioned this pull request Jun 10, 2025

[NEW] Use LTTng for generic server logging #2135

Open

hpatro mentioned this pull request Jun 27, 2025

Support Large Valkey Cluster #2281

Open

15 tasks

Avoid log spam about cluster node failure detection by each primary #2010

Avoid log spam about cluster node failure detection by each primary #2010

Uh oh!

Conversation

hpatro commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sarthakaggarwal97 Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

hpatro Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpatro Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

madolson commented Apr 29, 2025

Uh oh!

enjoy-binbin commented May 8, 2025

Uh oh!

hpatro commented May 12, 2025

Uh oh!

enjoy-binbin commented May 12, 2025

Uh oh!

zuiderkwast commented May 12, 2025

Uh oh!

sarthakaggarwal97 commented May 13, 2025

Uh oh!

hpatro commented Jun 5, 2025

Uh oh!

hpatro commented Jun 5, 2025

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hpatro commented Apr 26, 2025 •

edited

Loading

codecov bot commented Apr 26, 2025 •

edited

Loading

hpatro Apr 29, 2025 •

edited

Loading