Skip to content

High check latency after 2.12 upgrade / memory usage #8196

Closed
@fbs

Description

@fbs

Describe the bug

After upgrading to 2.12.0 we noticed a very high latency on all checks. The icinga machine reports a high amount of time spent system, before the upgrade this was ~10%, now it's 90%.

In our environment we have 2 icinga nodes doing checks, one is the "config leader" that pushes config (documentation calls it Top Down configuration sync). Both nodes should execute around the same amount of checks. To be able to compare the two we've downgraded the "config follower" to 2.11.5-1. We've only downgraded icinga, it runs the same kernel and other software.

Using bcc's syscount we can see how many syscalls icinga2 executes and how long they take, printing it every 10 seconds.

# /usr/share/bcc/tools/syscount  -T 30 -i 10 -L -m -p 2093

On the 2.12 leader we see (saw similar numbers on follower before downgrade):

[15:33:52]
SYSCALL                   COUNT        TIME (ms)
clone                       299      9215.332333
recvmsg                     579       584.742680
sendto                      578        24.774361
wait4                       279         6.216611
recvfrom                    579         3.767298
close                       897         2.192999

[15:34:02]
SYSCALL                   COUNT        TIME (ms)
clone                       278      9261.375636
recvmsg                     561       579.717236
sendto                      562        24.195989
recvfrom                    561         7.249004
wait4                       284         6.667528
close                       834         1.823393

And on the follower:

[15:33:53]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    1347      9477.930638
clone                       670       333.633895
wait4                       678        13.700700
sendto                     1348         6.523638
recvfrom                   1348         3.931384
close                      2010         2.529605

[15:34:03]
SYSCALL                   COUNT        TIME (ms)
recvmsg                     643      9734.994736
clone                       322       170.889077
wait4                       321         7.154626
sendto                      643         3.640546
recvfrom                    643         2.468145
close                       966         1.337561

The leader seems to max out around 300 clones/forks quite consistently (so 30/s), while I've seen the follower spike up 2k (200/s). Also note how the leader takes almost 10s to do those 300 clones, while the follower spends all it's time in recv.

Using a flamegraph we can dig a bit deeper

Screenshot 2020-08-27 at 16 23 29

With the leader on the left and follower on the right we can see a clear difference. On the follower libc_fork is barely visible (its the 2nd column from the left), while it takes up half the screen on the leader

Comparing the memory usage:

leader:

Groups:	988 990 991 1001
VmPeak:	 3247944 kB
VmSize:	 3247944 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 1270624 kB
VmRSS:	 1270624 kB
RssAnon:	 1269784 kB
RssFile:	     840 kB
RssShmem:	       0 kB
VmData:	 3099980 kB
VmStk:	     132 kB
VmExe:	   10936 kB
VmLib:	   39376 kB
VmPTE:	    2852 kB
VmSwap:	       0 kB
Threads:	1

follower:

Groups:	988 990 991 1001
VmPeak:	  735216 kB
VmSize:	  735216 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    2864 kB
VmRSS:	    2864 kB
RssAnon:	    1944 kB
RssFile:	     920 kB
RssShmem:	       0 kB
VmData:	  595696 kB
VmStk:	     132 kB
VmExe:	   10288 kB
VmLib:	   36168 kB
VmPTE:	     268 kB
VmSwap:	       0 kB
Threads:	1

So everything seems to point at the increased memory usage being the culprit.

After downgrading the leader:

Tracing syscalls, printing top 10... Ctrl+C to quit.
[16:30:51]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    3807      7675.037480
clone                      1923      1408.668509
sendto                     3808        66.448496
recvfrom                   3808        58.546543
wait4                      1885        46.540410
close                      5769         8.401268

[16:31:01]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    4533      7531.316624
clone                      2274      1475.591478
wait4                      2258        60.508798
sendto                     4532        56.665480
recvfrom                   4533        31.999594
close                      6822         9.985904

And memory:

VmPeak:	  735216 kB
VmSize:	  735216 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    2864 kB
VmRSS:	    2864 kB
RssAnon:	    1944 kB
RssFile:	     920 kB
RssShmem:	       0 kB
VmData:	  595696 kB
VmStk:	     132 kB
VmExe:	   10288 kB
VmLib:	   36168 kB
VmPTE:	     268 kB
VmSwap:	       0 kB

Comparing the leader between broken and ok we see a huge memory diff:

join memory.broken memory.ok | awk '/kB/ { if ($2 == 0) {print} else {print ($0, $2/$4, "x")} }' | column -tx
VmPeak:    3247944  kB  735216  kB  4.41767   x
VmSize:    3247944  kB  735216  kB  4.41767   x
VmLck:     0        kB  0       kB
VmPin:     0        kB  0       kB
VmHWM:     1270624  kB  2864    kB  443.654   x
VmRSS:     1270624  kB  2864    kB  443.654   x
RssAnon:   1269784  kB  1944    kB  653.181   x
RssFile:   840      kB  920     kB  0.913043  x
RssShmem:  0        kB  0       kB
VmData:    3099980  kB  595696  kB  5.20396   x
VmStk:     132      kB  132     kB  1         x
VmExe:     10936    kB  10288   kB  1.06299   x
VmLib:     39376    kB  36168   kB  1.0887    x
VmPTE:     2852     kB  268     kB  10.6418   x
VmSwap:    0        kB  0       kB

In ps we see a difference too. On 2.12 we see an extra icinga fork, not sure how relevant that is.

I got a copy of /proc/<pid> for 2.12 too, so if any other info is required I can probably get it (like proc/maps).

To Reproduce

Have quite a few hosts and checks.

Expected behavior

Low check latency, more clones

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):

Leader:

icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.0-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.5-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
  • Operating System and version:

up to date (as of yesterday) centos 7

  • Enabled features (icinga2 feature list):
Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb opentsdb perfdata statusdata syslog
Enabled features: api checker command ido-mysql livestatus mainlog notification
  • Config validation (icinga2 daemon -C):
information/cli: Icinga application loader (version: 2.12.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/CheckerComponent: Attribute 'concurrent_checks' for object 'checker' of type 'CheckerComponent' is deprecated and should not be used.
information/ApiListener: My API identity: n01.master
information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 32.8167/s (1969/min 1969/5min 1969/15min);
information/WorkQueue: #5 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1866 Hosts.
information/ConfigItem: Instantiated 427 Downtimes.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 2202 Comments.
information/ConfigItem: Instantiated 79360 Notifications.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 28 HostGroups.
information/ConfigItem: Instantiated 4 EventCommands.
information/ConfigItem: Instantiated 40359 Dependencies.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 2 Zones.
information/ConfigItem: Instantiated 2 Endpoints.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 9 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 241 CheckCommands.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 4 TimePeriods.
information/ConfigItem: Instantiated 2 Users.
information/ConfigItem: Instantiated 40359 Services.
information/ConfigItem: Instantiated 19 ScheduledDowntimes.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).

Metadata

Metadata

Assignees

Labels

area/checksCheck execution and resultsbugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions