Skip to content

Avoid log spam about cluster node failure detection by each primary #2010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

hpatro
Copy link
Collaborator

@hpatro hpatro commented Apr 26, 2025

After node failure detection/recovery and gossip by each primary, we log about the failure detection/recovery at NOTICE level which can spam the server and the behavior is quite expensive on ec2 burstable instance types. I would prefer us rolling it back to VERBOSE level.

Change was introduced in #633

@hpatro hpatro requested review from enjoy-binbin and madolson April 26, 2025 06:47
Copy link

codecov bot commented Apr 26, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.01%. Comparing base (0b94ca6) to head (0772c2f).
Report is 8 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2010      +/-   ##
============================================
- Coverage     71.01%   71.01%   -0.01%     
============================================
  Files           123      123              
  Lines         66033    66113      +80     
============================================
+ Hits          46892    46948      +56     
- Misses        19141    19165      +24     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.46% <100.00%> (+0.37%) ⬆️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -2409,13 +2409,13 @@ void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
if (sender) {
if (flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_PFAIL)) {
if (clusterNodeIsVotingPrimary(sender) && clusterNodeAddFailureReport(node, sender)) {
serverLog(LL_NOTICE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
serverLog(LL_VERBOSE, "Node %.40s (%s) reported node %.40s (%s) as not reachable.", sender->name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the level Warning? I am wondering if users depend on this log for any debugging.

Also, just out of curiosity, does changing log severity come under breaking change?

Copy link
Collaborator Author

@hpatro hpatro Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful in case of small clusters but unsure how valuable it is to log it for each primary in a large cluster setup.

My suggestion is to log the state periodically every few seconds to debug better. #2011

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthakaggarwal97 Btw warning would increase the severity of logging. I want to reduce it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack! #2011 makes sense to me!

@madolson
Copy link
Member

I don't feel strongly either way, so would appreciate input @enjoy-binbin once he is back from vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants