Description
As slightly noted in issue #2575 and in PRs that dealt with parallelized scans in nut-scanner
, depending on platform defaults and particular OS deployment and third-party library specifics, nut-scanner
may run out of file descriptors despite already trying to adapt the maximums to ulimit
information where available.
As seen recently and culminating in commit 2c3a09e of PR #2539 (issue #2511), certain libnetsnmp builds can consume FD's for network sockets, local filesystem looking for per-host configuration files or MIB files, for directory scanning during those searches, etc. This is a variable beyond our control, different implementations and versions of third-party code can behave as they please. Example staged with that commit reverted and a scan of a large network range:
...
0.321562 [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.254
0.321597 [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1022 thread_count=1022 stwST=-1 stwS=0 pass=1
0.321573 [D2] Entering try_SysOID_thready for 172.28.67.253
0.321667 [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.255
0.321703 [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1023 thread_count=1023 stwST=-1 stwS=0 pass=1
0.321677 [D2] Entering try_SysOID_thready for 172.28.67.254
0.321782 [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.68.0
0.321817 [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1024 thread_count=1024 stwST=-1 stwS=-1 pass=0
0.321851 [D2] nutscan_scan_ip_range_snmp: Running too many scanning threads (1024), waiting until older ones would finish
0.321796 [D2] Entering try_SysOID_thready for 172.28.67.255
0.475060 [D2] Failed to open SNMP session for 172.28.67.147
/var/lib/snmp/hosts/172.28.66.252.local.conf: Too many open files
/var/lib/snmp/hosts/172.28.65.208.local.conf: Too many open files
<blocks on "too many threads" anyway, but skips a number of hosts>
What we can do is not abort the scans upon any hiccup, but checking for errno==EMFILE
and delaying and retrying later (or maybe even actively decreasing the thread maximum variable of the process). We already have a way to detect Running too many scanning threads (NUM), waiting until older ones would finish
so that's about detecting the issue and extending criteria.