Skip to content

Fix rsyslogd memory growth in syncd swss containers over long term#25874

Open
tirupatihemanth wants to merge 1 commit intosonic-net:masterfrom
tirupatihemanth:rsyslogd_fix
Open

Fix rsyslogd memory growth in syncd swss containers over long term#25874
tirupatihemanth wants to merge 1 commit intosonic-net:masterfrom
tirupatihemanth:rsyslogd_fix

Conversation

@tirupatihemanth
Copy link
Contributor

Why I did it

  1. We observed long-term rsyslogd memory growth in syncd container.
  2. Deep diagnostics (impstats) showed imuxsock.ratelimit.numratelimiters growing continuously (about ~2/min), while queue depth stayed near zero, indicating sender/PID churn rather than queue backlog.
  3. phcsync.sh runs every 60 seconds and repeatedly invokes phc_ctl for /dev/ptp* devices. These short-lived process invocations contribute to new sender identities seen by imuxsock, which correlates with ratelimiter-state growth and memory increase over time because of data structures stored by rsyslogd for ratelimiting.
Work item tracking
  • Microsoft ADO (number only):

How I did it

  • Updated phcsync.sh in SONiC to keep successful phc_ctl execution silent:
  • Use phc_ctl -q -Q ... >/dev/null 2>&1
  • Keep explicit error handling and error logs on non-zero exit.
  • Added stable logger identity in service debug helpers:
  • logger -i "$$" -- "$1" in syncd_common.sh and swss.sh. This reduces per-call sender churn during script execution phases (start/wait/stop).

logger commands
before

Mar 04 03:55:44 sonic root[1775781]: Starting swss service...
Mar 04 03:55:44 sonic root[1775785]: Locking /tmp/swss-syncd-lock from swss service
Mar 04 03:55:44 sonic root[1775792]: Locked /tmp/swss-syncd-lock (10) from swss service
Mar 04 03:55:44 sonic root[1775816]: Warm boot flag: swss false.
Mar 04 03:55:44 sonic root[1775822]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
Mar 04 03:55:45 sonic root[1776045]: Started swss service...
Mar 04 03:55:45 sonic root[1776051]: Unlocking /tmp/swss-syncd-lock (10) from swss service

After

Mar 04 03:58:52 sonic root[1891651]: Starting swss service...
Mar 04 03:58:52 sonic root[1891651]: Locking /tmp/swss-syncd-lock from swss service
Mar 04 03:58:52 sonic root[1891651]: Locked /tmp/swss-syncd-lock (10) from swss service
Mar 04 03:58:52 sonic root[1891651]: Warm boot flag: swss false.
Mar 04 03:58:52 sonic root[1891651]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
Mar 04 03:58:53 sonic root[1891651]: Started swss service...
Mar 04 03:58:53 sonic root[1891651]: Unlocking /tmp/swss-syncd-lock (10) from swss service

How to verify it

  • imuxsock.ratelimit.numratelimiters in syncd should stop continuous growth (or reduce drastically).

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Signed-off-by: Hemanth Kumar Tirupati <tirupatihemanthkumar@gmail.com>
@tirupatihemanth tirupatihemanth requested a review from lguohan as a code owner March 4, 2026 04:05
Copilot AI review requested due to automatic review settings March 4, 2026 04:05
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses rsyslogd memory growth in the syncd and swss containers by reducing PID churn that was causing rsyslog's imuxsock ratelimiter to accumulate entries for short-lived senders. Two strategies are applied: suppressing unnecessary output from phc_ctl in phcsync.sh, and anchoring syslog messages to a stable PID ($$) in syncd_common.sh and swss.sh.

Changes:

  • phcsync.sh now runs phc_ctl with -q -Q flags and redirects stdout to /dev/null to suppress normal output, with explicit error logging on non-zero exit.
  • syncd_common.sh and swss.sh debug() functions use logger --id=$$ to emit all messages under the parent shell's PID, preventing a new ratelimiter entry per logger invocation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
platform/mellanox/docker-syncd-mlnx/phcsync.sh Adds -q -Q flags to silence normal phc_ctl output; redirects only stdout to /dev/null, removing the previous 2>/dev/null stderr suppression
files/scripts/syncd_common.sh Adds --id=$$ to logger in the debug() function to anchor all log messages to the parent shell's PID
files/scripts/swss.sh Same --id=$$ fix as syncd_common.sh for the debug() function in the swss service script

# set CLOCK_REALTIME
"$PHC_CTL" "$dev" set 2>/dev/null
# Keep successful syncs silent to avoid rsyslogd ratelimit memory issue due to PID churn.
"$PHC_CTL" -q -Q "$dev" set >/dev/null
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of 2>/dev/null from the phc_ctl invocation means that when phc_ctl fails, its own error output on stderr will be forwarded to syslog via supervisord (which has stderr_logfile=syslog in supervisord.conf.j2). Combined with the custom error message echoed to stderr at line 67, this will produce duplicate error entries in syslog on failure: one from phc_ctl's own stderr output and one from the explicit echo "Error: Failed to sync clock for $dev..." at line 67.

To avoid this double-logging on failure, the 2>/dev/null should be added back (or replaced with 2>&1) so that phc_ctl's own stderr is still suppressed while the explicit error message at line 67 provides a cleaner, controlled error notification. Alternatively, phc_ctl's stderr could be captured and included in the custom error message.

Suggested change
"$PHC_CTL" -q -Q "$dev" set >/dev/null
"$PHC_CTL" -q -Q "$dev" set >/dev/null 2>&1

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants