Description
Describe the bug
After upgrading to 2.12.0 we noticed a very high latency on all checks. The icinga machine reports a high amount of time spent system, before the upgrade this was ~10%, now it's 90%.
In our environment we have 2 icinga nodes doing checks, one is the "config leader" that pushes config (documentation calls it Top Down configuration sync). Both nodes should execute around the same amount of checks. To be able to compare the two we've downgraded the "config follower" to 2.11.5-1. We've only downgraded icinga, it runs the same kernel and other software.
Using bcc's syscount
we can see how many syscalls icinga2 executes and how long they take, printing it every 10 seconds.
# /usr/share/bcc/tools/syscount -T 30 -i 10 -L -m -p 2093
On the 2.12 leader we see (saw similar numbers on follower before downgrade):
[15:33:52]
SYSCALL COUNT TIME (ms)
clone 299 9215.332333
recvmsg 579 584.742680
sendto 578 24.774361
wait4 279 6.216611
recvfrom 579 3.767298
close 897 2.192999
[15:34:02]
SYSCALL COUNT TIME (ms)
clone 278 9261.375636
recvmsg 561 579.717236
sendto 562 24.195989
recvfrom 561 7.249004
wait4 284 6.667528
close 834 1.823393
And on the follower:
[15:33:53]
SYSCALL COUNT TIME (ms)
recvmsg 1347 9477.930638
clone 670 333.633895
wait4 678 13.700700
sendto 1348 6.523638
recvfrom 1348 3.931384
close 2010 2.529605
[15:34:03]
SYSCALL COUNT TIME (ms)
recvmsg 643 9734.994736
clone 322 170.889077
wait4 321 7.154626
sendto 643 3.640546
recvfrom 643 2.468145
close 966 1.337561
The leader seems to max out around 300 clones/forks quite consistently (so 30/s), while I've seen the follower spike up 2k (200/s). Also note how the leader takes almost 10s to do those 300 clones, while the follower spends all it's time in recv.
Using a flamegraph we can dig a bit deeper
With the leader on the left and follower on the right we can see a clear difference. On the follower libc_fork is barely visible (its the 2nd column from the left), while it takes up half the screen on the leader
Comparing the memory usage:
leader:
Groups: 988 990 991 1001
VmPeak: 3247944 kB
VmSize: 3247944 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 1270624 kB
VmRSS: 1270624 kB
RssAnon: 1269784 kB
RssFile: 840 kB
RssShmem: 0 kB
VmData: 3099980 kB
VmStk: 132 kB
VmExe: 10936 kB
VmLib: 39376 kB
VmPTE: 2852 kB
VmSwap: 0 kB
Threads: 1
follower:
Groups: 988 990 991 1001
VmPeak: 735216 kB
VmSize: 735216 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2864 kB
VmRSS: 2864 kB
RssAnon: 1944 kB
RssFile: 920 kB
RssShmem: 0 kB
VmData: 595696 kB
VmStk: 132 kB
VmExe: 10288 kB
VmLib: 36168 kB
VmPTE: 268 kB
VmSwap: 0 kB
Threads: 1
So everything seems to point at the increased memory usage being the culprit.
After downgrading the leader:
Tracing syscalls, printing top 10... Ctrl+C to quit.
[16:30:51]
SYSCALL COUNT TIME (ms)
recvmsg 3807 7675.037480
clone 1923 1408.668509
sendto 3808 66.448496
recvfrom 3808 58.546543
wait4 1885 46.540410
close 5769 8.401268
[16:31:01]
SYSCALL COUNT TIME (ms)
recvmsg 4533 7531.316624
clone 2274 1475.591478
wait4 2258 60.508798
sendto 4532 56.665480
recvfrom 4533 31.999594
close 6822 9.985904
And memory:
VmPeak: 735216 kB
VmSize: 735216 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2864 kB
VmRSS: 2864 kB
RssAnon: 1944 kB
RssFile: 920 kB
RssShmem: 0 kB
VmData: 595696 kB
VmStk: 132 kB
VmExe: 10288 kB
VmLib: 36168 kB
VmPTE: 268 kB
VmSwap: 0 kB
Comparing the leader between broken and ok we see a huge memory diff:
join memory.broken memory.ok | awk '/kB/ { if ($2 == 0) {print} else {print ($0, $2/$4, "x")} }' | column -tx
VmPeak: 3247944 kB 735216 kB 4.41767 x
VmSize: 3247944 kB 735216 kB 4.41767 x
VmLck: 0 kB 0 kB
VmPin: 0 kB 0 kB
VmHWM: 1270624 kB 2864 kB 443.654 x
VmRSS: 1270624 kB 2864 kB 443.654 x
RssAnon: 1269784 kB 1944 kB 653.181 x
RssFile: 840 kB 920 kB 0.913043 x
RssShmem: 0 kB 0 kB
VmData: 3099980 kB 595696 kB 5.20396 x
VmStk: 132 kB 132 kB 1 x
VmExe: 10936 kB 10288 kB 1.06299 x
VmLib: 39376 kB 36168 kB 1.0887 x
VmPTE: 2852 kB 268 kB 10.6418 x
VmSwap: 0 kB 0 kB
In ps
we see a difference too. On 2.12 we see an extra icinga fork, not sure how relevant that is.
I got a copy of /proc/<pid>
for 2.12 too, so if any other info is required I can probably get it (like proc/maps).
To Reproduce
Have quite a few hosts and checks.
Expected behavior
Low check latency, more clones
Your Environment
Include as many relevant details about the environment you experienced the problem in
- Version used (
icinga2 --version
):
Leader:
icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.0-1)
Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-1127.18.2.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: runner-hh8q3bz2-project-322-concurrent-0
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.5-1)
Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-1127.18.2.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: runner-hh8q3bz2-project-322-concurrent-0
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
- Operating System and version:
up to date (as of yesterday) centos 7
- Enabled features (
icinga2 feature list
):
Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb opentsdb perfdata statusdata syslog
Enabled features: api checker command ido-mysql livestatus mainlog notification
- Config validation (
icinga2 daemon -C
):
information/cli: Icinga application loader (version: 2.12.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/CheckerComponent: Attribute 'concurrent_checks' for object 'checker' of type 'CheckerComponent' is deprecated and should not be used.
information/ApiListener: My API identity: n01.master
information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 32.8167/s (1969/min 1969/5min 1969/15min);
information/WorkQueue: #5 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1866 Hosts.
information/ConfigItem: Instantiated 427 Downtimes.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 2202 Comments.
information/ConfigItem: Instantiated 79360 Notifications.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 28 HostGroups.
information/ConfigItem: Instantiated 4 EventCommands.
information/ConfigItem: Instantiated 40359 Dependencies.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 2 Zones.
information/ConfigItem: Instantiated 2 Endpoints.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 9 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 241 CheckCommands.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 4 TimePeriods.
information/ConfigItem: Instantiated 2 Users.
information/ConfigItem: Instantiated 40359 Services.
information/ConfigItem: Instantiated 19 ScheduledDowntimes.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).