Skip to content

Investigate issue #3302 driver behavior when upsd aborts#3368

Open
jimklimov wants to merge 8 commits intonetworkupstools:masterfrom
jimklimov:issue-3302-drvconn
Open

Investigate issue #3302 driver behavior when upsd aborts#3368
jimklimov wants to merge 8 commits intonetworkupstools:masterfrom
jimklimov:issue-3302-drvconn

Conversation

@jimklimov
Copy link
Member

@jimklimov jimklimov commented Mar 24, 2026

Start by poking upsdrvcrl for both WIN32 and POSIX builds...

Includes code from PR #3367 to try reproducing the issue.

UPDATE: Maybe specific to dummy-ups, reproduced both for standalone starts of the driver program directly, one driver via upsdrvctl (note: the latter does not seem to propagate the exit-code and returns 0, at least on Windows, probably should indicate an error), and a swarm of drivers via upsdrvctl (also exits with code 0 even if all drivers died abruptly). Sometimes it took several starts of upsd to be killed a few seconds later.

In all these cases the final words were like:

  • upsd
   0.011162     Connected to UPS [UPS2]: dummy-ups-UPS2
   0.011473     Connected to UPS [UPS1]: dummy-ups-UPS1
   0.011641     Connected to UPS [dummy]: dummy-ups-dummy
   0.011762     Your system limits the maximum number of connections to 64
but you requested 348. The server may handle connections
in smaller groups, maybe affecting efficiency and response time.

   0.011989     [D1:6336:upsd.exe] poll_reload: (p)re-allocate 348 entries for polling FDs and handlers
   0.012186     Found 53 UPS defined in ups.conf
...
   0.186789     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm43] driver is still not connected (FD -1)
   0.186938     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm44] driver is not currently connected, trying to reconnect
   0.187080     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm44] driver is still not connected (FD -1)
(end of log)
  • with a Ctrl+C at this point, there is even no clean-up logged (are signals set up later? shouldn't be...)
  • a longer-living upsd sometimes logs the clean-up:
...
   1.842622     [D1:3404:upsd.exe] mainloop: UPS [UPSwarm50] driver is still not connected (FD -1)
   1.870824     [D1:3404:upsd.exe] upsd_cleanup: starting the end-game
   1.871082     [D1:3404:upsd.exe] driver_free: forgetting UPS [dummy] (FD 644)
   1.871209     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPS1] (FD 636)
   1.871347     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPS2] (FD 628)
   1.871459     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm1] (FD -1)
   1.871547     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm2] (FD -1)
...
   1.880913     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm50] (FD -1)
   1.881051     [D1:3404:upsd.exe] upsd_cleanup: finished
  • On the dummy-ups side it seems to always end with the same entering parse_data_file() call (and exit-code 127) after failing to write to the server:
...
   3.773319     [D5:16220:dummy-ups:UPSwarm11] send_to_one: SETAUX outlet.3.id 32
   3.787747     [D6:16220:dummy-ups:UPSwarm11] send_to_one: write 22 bytes to handle 000000000000011C succeeded (ret=22): SETAUX outlet.3.id 32

   3.787949     [D2:16220:dummy-ups:UPSwarm11] send_to_one: sending SETFLAGS outlet.3.id RW STRING
   3.788031     [D5:16220:dummy-ups:UPSwarm11] send_to_one: SETFLAGS outlet.3.id RW STRING
   3.788133     WARNING: send_to_one: write 31 bytes to handle 000000000000011C failed (ret=0), disconnecting.: No error [The pipe is being closed. ]
   3.788250     [D6:16220:dummy-ups:UPSwarm11] send_to_one: failed write: SETFLAGS outlet.3.id RW STRING

   3.788364     [D3:16220:dummy-ups:UPSwarm11] sock_disconnect: disconnecting named pipe handle 000000000000011C
   3.788502     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: finishing parsing context
   3.788591     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: relinking the chain of connections
   3.788667     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: freeing the conn object
   3.788755     [D5:16220:dummy-ups:UPSwarm11] send_to_all: SETINFO driver.state "updateinfo"
   3.788861     [D1:16220:dummy-ups:UPSwarm11] upsdrv_updateinfo...
   4.793998     [D1:16220:dummy-ups:UPSwarm11] entering parse_data_file()
(end of log)

$ echo $?
127

I don't think I've reproduced nor ruled out the problem on non-Windows builds yet.

Per GDB and added debug-logging traces, it seems to crash around malloc() calls, whether in PCONF context init or in vupslog() a bit before it gets there:

...
   8.255625     [D3:15120:dummy-ups:UPSwarm11] sock_connect: new connection on handle 0000000000000120
   8.255752     [D6:15120:dummy-ups:UPSwarm11] sock_arg: Driver on \\.\pipe\dummy-ups-UPSwarm11 is now handling DUMPALL with 1 args
   8.255874     [D2:15120:dummy-ups:UPSwarm11] send_to_one: sending SETINFO device.mfr "EATON | Powerware"
   8.256006     [D5:15120:dummy-ups:UPSwarm11] send_to_one: SETINFO device.mfr "EATON | Powerware"
   8.256165     WARNING: send_to_one: write 39 bytes to handle 0000000000000120 failed (ret=0), disconnecting.: No error [The pipe is being closed. ]
   8.256322     [D6:15120:dummy-ups:UPSwarm11] send_to_one: failed write: SETINFO device.mfr "EATON | Powerware"

   8.256457     [D3:15120:dummy-ups:UPSwarm11] sock_disconnect: disconnecting named pipe handle 0000000000000120
   8.256585     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: finishing parsing context
   8.256706     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: relinking the chain of connections
   8.256798     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: freeing the conn object

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffb8325d990 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ffb8325d990 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffb8325b44d in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#2  0x00007ffb83328a18 in ntdll!RtlRegisterSecureMemoryCacheCallback () from C:\Windows\SYSTEM32\ntdll.dll
#3  0x00007ffb8325d255 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#4  0x00007ffb8325b44d in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#5  0x00007ffb8144a131 in realloc () from C:\Windows\System32\msvcrt.dll
#6  0x00007ffb81449c45 in msvcrt!calloc () from C:\Windows\System32\msvcrt.dll
#7  0x00007ff6b737f0a9 in xcalloc (number=<optimized out>, size=<optimized out>) at common.c:4760
#8  0x00007ff6b737f12e in vupslog (priority=2, fmt=0x5fed50 "[D5:15120:dummy-ups:UPSwarm11] %s: %.*s", va=0x5ff170 "", use_strerror=0) at common.c:3816
#9  0x00007ff6b737f8ce in s_upsdebugx (level=<optimized out>, level@entry=5, fmt=0x5fed50 "[D5:15120:dummy-ups:UPSwarm11] %s: %.*s",
    fmt@entry=0x7ff6b73a0e99 <__func__.16+681> "%s: %.*s") at common.c:4535
#10 0x00007ff6b7378591 in send_to_all (fmt=fmt@entry=0x7ff6b73a0f6b <__func__.16+891> "SETINFO %s \"%s\"\n") at dstate.c:305
#11 0x00007ff6b73786b0 in vdstate_setinfo (var=var@entry=0x7ff6b739cfbf <__func__.4+1047> "driver.state", fmt=fmt@entry=0x7ff6b73a08d1 <__func__.4+15657> "updateinfo",
    ap=ap@entry=0x5ff5a0 "d") at dstate.c:1386
#12 0x00007ff6b737715d in dstate_setinfo (var=var@entry=0x7ff6b739cfbf <__func__.4+1047> "driver.state", fmt=fmt@entry=0x7ff6b73a08d1 <__func__.4+15657> "updateinfo")
    at dstate.c:1412
#13 0x00007ff6b739aa7f in main (argc=<optimized out>, argv=<optimized out>) at main.c:3268

@jimklimov jimklimov added this to the 2.8.5 milestone Mar 24, 2026
@jimklimov jimklimov added service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time labels Mar 24, 2026
@AppVeyorBot
Copy link

@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from fd80697 to 97eca57 Compare March 24, 2026 03:06
@AppVeyorBot
Copy link

jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…proctag() is called) [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…ally and flip to specified upsname later [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…sing setproctag() [networkupstools#3302, networkupstools#3368]

Did not work for parallel scanning threads where it would be most useful,
because they are in same process space...

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…pthreads so far [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@AppVeyorBot
Copy link

@jimklimov jimklimov added the CI Entries related to continuous integration infrastructure (here CI = tools + scripts + recipes) label Mar 25, 2026
@jimklimov jimklimov added the portability We want NUT to build and run everywhere possible label Mar 25, 2026
@jimklimov jimklimov moved this to In Progress in NUT for Windows Mar 25, 2026
@jimklimov jimklimov moved this to In Progress in NUT QA Mar 25, 2026
@AppVeyorBot
Copy link

@jimklimov
Copy link
Member Author

jimklimov commented Mar 25, 2026

NOTE: After #3363 it seems that upsd quickly loops (without delay) over the expected connections while they are not established (e.g. drivers still starting). Need to check if it is a regression.

UPDATE: Older Windows builds did similarly (tested with 2.8.4.1572-1572+g69e282b3b+v2.8.5+rc5 and a small swarm of 50 drivers, to be under 64 connections):

  • when no drivers were started, upsd tried to find talkative sockets every 2 seconds, then slept between messages about "driver still not connected".
  • when at least one driver started, the loop was intensive
  • when all drivers stared, no low-verbosity messages appear in logs, but with bumped verbosity I can see the loop runs intensively to walk all connection handles (visually by the stream of text, about 1.2 cycles per second, with no time-skips/sleeps in the time-stamped log)
  • notably, dummy-ups has a chance to crash after an abrupt server disconnection in that older build too (the focus of this investigation)

Older Linux build (2.8.4.1541.9-1550+g7cd79ab73, with 3 dummy devices from NIT):

  • checked every 2 seconds when no drivers were started, but
  • when one driver was started, the loop reported a stack of adding FD handler and polling 3 filedescriptors; some stats: considered 6 connections, wanted to actually poll 3 and was constrained by maxconn=1024 and chunked by sysmaxconn=1024 3 times in the same tenth of a second, then one more after 0.1 second, and repeated after a whole second. Eventually settled on polling every second - I suppose the first burst was with the driver reporting the data it had, etc.
  • when eventually all drivers were started, there are a couple of polls per second, at repetitive but asymmetric (e.g. NUM.2 and NUM.5) timestamps.

jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…proctag() is called) [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from 5ca8690 to cf14d94 Compare March 25, 2026 17:31
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…proctag() is called) [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…ally and flip to specified upsname later [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…sing setproctag() [networkupstools#3302, networkupstools#3368]

Did not work for parallel scanning threads where it would be most useful,
because they are in same process space...

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…pthreads so far [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…-check before retry

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…-check before retry - also for WIN32

Also revised WaitForSingleObject() result checking - there
has to be a chance to succeed ;)

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…(driver, client...) [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from 11ae094 to d0a556b Compare March 25, 2026 17:56
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…isconnect() [networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov
Copy link
Member Author

Further evicted some commits grouped into side PRs (fast-merged based on successful tests in earlier iteration here), so changes proposed by this one focus on driver crashing issues again.

jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…isconnect() [networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from d0a556b to 458a7bc Compare March 25, 2026 18:01
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from 458a7bc to 0ade5d4 Compare March 25, 2026 18:08
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from 0ade5d4 to 7bf84e0 Compare March 25, 2026 19:36
@AppVeyorBot
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Entries related to continuous integration infrastructure (here CI = tools + scripts + recipes) Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time portability We want NUT to build and run everywhere possible service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug

Projects

Status: In Progress
Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants