Failed kill #147

ldoktor · 2025-04-30T09:11:54Z

@smitterl identified a bug in aexpect, when unkillable aexpect_worker results in indefinite hang of Spawn.close(). To mitigate that I added a timeout to our wait_for_lock and used it in close() (this fix can be tested by commenting-out the kill). Then I noticed the close() can actually be called multiple times when the user uses multiple processes/threads. To address that I'd suggest moving to BSD flock rather than posix lockf as that one protects individual threads and adding another close lock file to protect this critical path. This can be tested by:

import aexpect
import threading
#import pydevd

def close_session(session):
    """Closes the given session."""
    #pydevd.settrace("127.0.0.1")
    session.close()

# Assuming you have a list of sessions to be closed
session = aexpect.ShellSession("bash")

threads = []
for _ in range(20000):
    thread = threading.Thread(target=close_session, args=(session,))
    threads.append(thread)

for thread in threads:
    thread.start()

for t in threads:
    t.join()

del(session)

print("DONE")

One thing I'm not completely sure about is whether we don't rely on the non-thread-safe lockf in aexpect so please test this thoroughly.

aexpect/client.py

+        lock = None
+        try:
+            try:
+                lock = get_lock_fd(self._close_lockfile, timeout=60)


aexpect/shared.py

+    :return: True on success, False on failure/timeout
+    """
+    try:
+        lock_fd = get_lock_fd(filename, timeout)


ldoktor · 2025-04-30T12:43:23Z

@pevogam could you please take a look at this? I'm not completely sure we don't rely on the same threads being able to re-entry the critical sections so thorough testing would be appreciated.

pevogam · 2025-05-05T14:24:25Z

@pevogam could you please take a look at this? I'm not completely sure we don't rely on the same threads being able to re-entry the critical sections so thorough testing would be appreciated.

Sure, added to my backlog in the coming days.

smitterl · 2025-05-06T17:23:09Z

I haven't had time to review yet. I'm running my test jobs with this patch. So far no regression and no occurrence of the timeout issue.

smitterl · 2025-05-06T17:26:30Z

I haven't had time to review yet. I'm running my test jobs with this patch. So far no regression and no occurrence of the timeout issue.

Unfortunately, just after posting this comment I found another hang, this time the last logged error message was

"2025-05-06 11:47:11,848 aexpect.client client L0433 WARNI| Failed to get lock, the aexpect_helper process might be left behind. Proceeding anyway..."

The hang apparently happened again during session.close for serial cleanup:

025-05-06 11:45:55,153 avocado.boot_integration boot_integration L0287 DEBUG| Start to cleanup
2025-05-06 11:45:55,166 avocado.virttest.libvirt_vm libvirt_vm       L2364 DEBUG| Destroying VM
2025-05-06 11:45:55,175 avocado.virttest.libvirt_vm libvirt_vm       L2373 DEBUG| Trying to shutdown VM with shell command
2025-05-06 11:45:55,188 avocado.virttest.virt_vm virt_vm          L1504 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
2025-05-06 11:45:56,760 avocado.virttest.env_process env_process      L1771 DEBUG| avocado-vt-vm1 alive now. Used to failed to get register info from guest 1 times
2025-05-06 11:46:02,757 avocado.virttest.ip_sniffing ip_sniffing      L0058 DEBUG| Updated HWADDR (52:54:00:a6:1b:18)<->(192.168.122.24) IP pair into address cache
2025-05-06 11:46:06,767 aexpect.client client           L1244 DEBUG| Sending command: echo %OS%
2025-05-06 11:46:06,869 aexpect.client client           L1289 DEBUG| Sending command (safe): ifconfig -a
2025-05-06 11:46:06,974 aexpect.client client           L1289 DEBUG| Sending command (safe): ip link | grep -B1 '' -i
2025-05-06 11:46:07,077 aexpect.client client           L1244 DEBUG| Sending command: test -d /sys/class/net/enc1
2025-05-06 11:46:07,178 aexpect.client client           L1244 DEBUG| Sending command: echo $?
2025-05-06 11:46:07,279 aexpect.client client           L1244 DEBUG| Sending command: cat /sys/class/net/enc1/address
2025-05-06 11:46:07,382 aexpect.client client           L1289 DEBUG| Sending command (safe): ifconfig enc1 || ip address show enc1
2025-05-06 11:46:08,741 avocado.virttest.libvirt_vm libvirt_vm       L2382 DEBUG| Shutdown command sent; waiting for VM to go down...
2025-05-06 11:46:11,835 avocado.virttest.libvirt_vm libvirt_vm       L2386 DEBUG| VM is down
2025-05-06 11:46:11,835 avocado.virttest.env_process env_process      L1747 WARNI| registers is not alive. Can't query the avocado-vt-vm1 status
2025-05-06 11:47:11,848 aexpect.client client           L0433 WARNI| Failed to get lock, the aexpect_helper process might be left behind. Proceeding anyway...

ldoktor · 2025-05-07T06:36:20Z

@smitterl could you please add print of all processes in the system as well as free memory in case it fails to get the lock? Because if kill -9 fails it usually means there is something ugly happening in the system. As far as I know it can only happen when the process is in kernel mode (usually not enough memory or in uninterruptable wait in kernel - waiting for HW).

smitterl · 2025-05-07T08:24:30Z

@smitterl could you please add print of all processes in the system as well as free memory in case it fails to get the lock? Because if kill -9 fails it usually means there is something ugly happening in the system. As far as I know it can only happen when the process is in kernel mode (usually not enough memory or in uninterruptable wait in kernel - waiting for HW).

Sure. But I usually don't have means to identify the moment it fails to get the lock before our CI kills the test job because it only reproduces in CI.
As I can run the test suite anytime with whatever is in your branch, how about you add the traces you are interested in another commit for now?

ldoktor · 2025-05-07T10:51:08Z

Something like #148

smitterl · 2025-05-07T15:23:38Z

Something like #148

Thank you for the quick response. I've triggered the test job.

smitterl · 2025-05-08T10:54:02Z

I had to retrigger the tests because my CI branch was out of date.

Not sure if useful: While I was debugging some other test just now it seemed to me to reproduce - but not with your patch. Still maybe this information is useful.

While it was stuck:

# ps aux|grep aex
root      457606  0.0  0.0  32344 27492 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      457759  0.1  0.0  32344 27360 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      458689  0.0  0.0   6460  1968 pts/5    S+   06:48   0:00 grep --color=auto aex
# kill -9 457606
# ps aux|grep aex
root      457606  0.0  0.0      0     0 pts/1    Z+   06:46   0:00 [aexpect_helper] <defunct>
root      457759  0.0  0.0  32344 27360 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      458812  0.0  0.0   6460  2032 pts/5    S+   06:48   0:00 grep --color=auto aex

At this point the test was still stuck, so I tried to kill the other one

# kill -9 457759
# ps aux|grep aex
root      457606  0.0  0.0      0     0 pts/1    Z+   06:46   0:00 [aexpect_helper] <defunct>
root      459044  6.2  0.0  32492 27488 pts/1    S+   06:48   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      459091  0.0  0.0   6460  2032 pts/5    S+   06:48   0:00 grep --color=auto aex

and the test continued. After it finished:

# ps aux|grep aex
root      460726  0.0  0.0   6460  2032 pts/5    S+   06:52   0:00 grep --color=auto aex

smitterl · 2025-05-08T11:41:07Z

I'm a bit concerned because the recent test runs lead to Jenkins losing connection to the host temporarily after some test ends in "FATAL: command execution failed".
No coredump was listed. But I think there must have been a reboot:

# uptime
07:35:56 up  1:31,  2 users,  load average: 0.00, 0.00, 0.00

The time stamps are a bit confusing but from that it would mean the system is up since about 06:24:56 - and my FATAL log before losing connection happens at timestamp 06:25:09 at that system must have been up at least since 06:17:39 according to the previous CI logs.

This is reproducible, I estimate to about 100% with the failed-kill*-debug* branch.

pevogam · 2025-05-12T07:02:39Z

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

smitterl · 2025-05-14T10:22:44Z

My last 3 runs with this patch also succeeded and the original issue didn't reproduce. My test runs without this patch had reproduced it rather reliably so this should work now. I'm just rerunning without this patch to double check if it is needed. But in any case I'm in favor of merging this because from what I understand it will be an improvement for the handling of sessions.

smitterl · 2025-05-14T16:31:08Z

FWIW, without this patch it didn't reproduce now either but, as said earlier, I'm in favor of accepting this PR to improve the session handling and there have been no indications of regressions introduced by it.

clebergnu

Hi @ldoktor,

This looks alright to me, but it doesn't hurt to be extra careful and prevent freezes if for some reason the machine clock changes (see my comment about the monotonic timer use).

PS: thanks for the reproducer, it was invaluable.

clebergnu · 2025-05-20T02:01:34Z

aexpect/shared.py

+    lock_flags = fcntl.LOCK_EX
+    if timeout > 0:
+        lock_flags |= fcntl.LOCK_NB
+    end_time = time.time() + timeout if timeout > 0 else -1


One good practice here is to use time.monotonic() instead of regular "wall clock".

ldoktor · 2025-05-21T08:02:12Z

Hi @ldoktor,

This looks alright to me, but it doesn't hurt to be extra careful and prevent freezes if for some reason the machine clock changes (see my comment about the monotonic timer use).

PS: thanks for the reproducer, it was invaluable.

Thanks for the suggestion, I treated is globally in this project in a separate commit.

In case aexpect_helper kill fails we end-up waiting for infinity (eg. when in system call or unschedulable state). Let's change this to up-to-60s wait and then proceeding with warning, which might result in left-behind aexpect_helper but should proceed with other testing. Signed-off-by: Lukáš Doktor <[email protected]>

The POSIX lockf doesn't protect other threads accessing the lock while BSD lock does. Let's switch to BSD locks to protect against multiple threads accessing the same sections. Signed-off-by: Lukáš Doktor <[email protected]>

The Spawn.close() contains critical section that should not be accessed multiple times, which might happen when a user uses aexpect from different threads/processes. Let's add a lock to protect it. Signed-off-by: Lukáš Doktor <[email protected]>

to avoid problems with NTP (or iffy clock sources) let's use time.monotonic() for time difference operations. Signed-off-by: Lukáš Doktor <[email protected]>

ldoktor · 2025-05-27T11:10:30Z

Changes:

rebased with the black linter

clebergnu

LGTM, thanks!

ldoktor · 2025-05-28T15:12:58Z

Thanks, the failure is in a different file, let's treat it separately

ldoktor · 2025-05-28T15:17:29Z

Oh no, it was actually part of this PR... Anyway already merged so let me address that.

pevogam · 2025-06-12T16:32:42Z

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

ldoktor · 2025-06-13T14:38:09Z

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

Hello @pevogam I'm not sure I'm following the question. Are you stating this Bad file descriptor occurs as a new issue with this PR, or is it something that actually got resolved (unexpectedly) by this PR? Or is it a real issue or just a minor warning?

pevogam · 2025-06-13T16:00:16Z

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

Hello @pevogam I'm not sure I'm following the question. Are you stating this Bad file descriptor occurs as a new issue with this PR, or is it something that actually got resolved (unexpectedly) by this PR? Or is it a real issue or just a minor warning?

In short, I confirmed in the past this PR does not cause any damage and was using Python 3.12. However, testing currently with Python 3.13 it does cause an issue and I could confirm that by removing this exact set of patches to have passing tests again (with Python 3.13. So unfortunately it is an issue that this PR causes and I can only observe it with the newest stable Python 3.13. It seems like a minor issue since our VT tests succeed for the most part but end up with a FAIL status due to such unclosable aexpect helpers.

github-advanced-security bot found potential problems Apr 30, 2025

View reviewed changes

ldoktor mentioned this pull request May 7, 2025

Failed kill debug #148

Closed

smitterl approved these changes May 14, 2025

View reviewed changes

smitterl mentioned this pull request May 19, 2025

Draft: DON'T MERGE debug code avocado-framework/avocado-vt#4107

Closed

clebergnu requested changes May 20, 2025

View reviewed changes

ldoktor force-pushed the failed-kill branch from 4b1550d to f648212 Compare May 27, 2025 11:08

ldoktor added 4 commits May 27, 2025 13:09

aexpect: Use time.monotonic() for time-diff operations

3a8ea81

to avoid problems with NTP (or iffy clock sources) let's use time.monotonic() for time difference operations. Signed-off-by: Lukáš Doktor <[email protected]>

ldoktor force-pushed the failed-kill branch from f648212 to 3a8ea81 Compare May 27, 2025 11:10

clebergnu approved these changes May 28, 2025

View reviewed changes

ldoktor merged commit 659ac1e into avocado-framework:main May 28, 2025
2 of 3 checks passed

ldoktor mentioned this pull request Jun 18, 2025

client: Avoid double-cleanup exception #156

Merged

pevogam mentioned this pull request Jun 29, 2025

Could not acquire exclusive lock to access _open_log_files avocado-framework/avocado-vt#4148

Open

Failed kill #147

Failed kill #147

Uh oh!

Conversation

ldoktor commented Apr 30, 2025

Uh oh!

Check warning

Check warning

ldoktor commented Apr 30, 2025

Uh oh!

pevogam commented May 5, 2025

Uh oh!

smitterl commented May 6, 2025

Uh oh!

smitterl commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldoktor commented May 7, 2025

Uh oh!

smitterl commented May 7, 2025

Uh oh!

ldoktor commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smitterl commented May 7, 2025

Uh oh!

smitterl commented May 8, 2025

Uh oh!

smitterl commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pevogam commented May 12, 2025

Uh oh!

smitterl commented May 14, 2025

Uh oh!

smitterl commented May 14, 2025

Uh oh!

clebergnu left a comment

Choose a reason for hiding this comment

Uh oh!

clebergnu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ldoktor commented May 21, 2025

Uh oh!

ldoktor commented May 27, 2025

Uh oh!

clebergnu left a comment

Choose a reason for hiding this comment

Uh oh!

ldoktor commented May 28, 2025

Uh oh!

Uh oh!

ldoktor commented May 28, 2025

Uh oh!

pevogam commented Jun 12, 2025

Uh oh!

ldoktor commented Jun 13, 2025

Uh oh!

pevogam commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smitterl commented May 6, 2025 •

edited

Loading

ldoktor commented May 7, 2025 •

edited

Loading

smitterl commented May 8, 2025 •

edited

Loading