Skip to content

Conversation

@ldoktor
Copy link
Contributor

@ldoktor ldoktor commented Apr 30, 2025

@smitterl identified a bug in aexpect, when unkillable aexpect_worker results in indefinite hang of Spawn.close(). To mitigate that I added a timeout to our wait_for_lock and used it in close() (this fix can be tested by commenting-out the kill). Then I noticed the close() can actually be called multiple times when the user uses multiple processes/threads. To address that I'd suggest moving to BSD flock rather than posix lockf as that one protects individual threads and adding another close lock file to protect this critical path. This can be tested by:

import aexpect
import threading
#import pydevd

def close_session(session):
    """Closes the given session."""
    #pydevd.settrace("127.0.0.1")
    session.close()

# Assuming you have a list of sessions to be closed
session = aexpect.ShellSession("bash")

threads = []
for _ in range(20000):
    thread = threading.Thread(target=close_session, args=(session,))
    threads.append(thread)

for thread in threads:
    thread.start()

for t in threads:
    t.join()

del(session)

print("DONE")

One thing I'm not completely sure about is whether we don't rely on the non-thread-safe lockf in aexpect so please test this thoroughly.

lock = None
try:
try:
lock = get_lock_fd(self._close_lockfile, timeout=60)

Check warning

Code scanning / CodeQL

File is not always closed Warning

File is opened but is not closed.
:return: True on success, False on failure/timeout
"""
try:
lock_fd = get_lock_fd(filename, timeout)

Check warning

Code scanning / CodeQL

File is not always closed Warning

File is opened but is not closed.
@ldoktor
Copy link
Contributor Author

ldoktor commented Apr 30, 2025

@pevogam could you please take a look at this? I'm not completely sure we don't rely on the same threads being able to re-entry the critical sections so thorough testing would be appreciated.

@pevogam
Copy link
Contributor

pevogam commented May 5, 2025

@pevogam could you please take a look at this? I'm not completely sure we don't rely on the same threads being able to re-entry the critical sections so thorough testing would be appreciated.

Sure, added to my backlog in the coming days.

@smitterl
Copy link

smitterl commented May 6, 2025

I haven't had time to review yet. I'm running my test jobs with this patch. So far no regression and no occurrence of the timeout issue.

@smitterl
Copy link

smitterl commented May 6, 2025

I haven't had time to review yet. I'm running my test jobs with this patch. So far no regression and no occurrence of the timeout issue.

Unfortunately, just after posting this comment I found another hang, this time the last logged error message was

"2025-05-06 11:47:11,848 aexpect.client client L0433 WARNI| Failed to get lock, the aexpect_helper process might be left behind. Proceeding anyway..."

The hang apparently happened again during session.close for serial cleanup:

025-05-06 11:45:55,153 avocado.boot_integration boot_integration L0287 DEBUG| Start to cleanup
2025-05-06 11:45:55,166 avocado.virttest.libvirt_vm libvirt_vm       L2364 DEBUG| Destroying VM
2025-05-06 11:45:55,175 avocado.virttest.libvirt_vm libvirt_vm       L2373 DEBUG| Trying to shutdown VM with shell command
2025-05-06 11:45:55,188 avocado.virttest.virt_vm virt_vm          L1504 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
2025-05-06 11:45:56,760 avocado.virttest.env_process env_process      L1771 DEBUG| avocado-vt-vm1 alive now. Used to failed to get register info from guest 1 times
2025-05-06 11:46:02,757 avocado.virttest.ip_sniffing ip_sniffing      L0058 DEBUG| Updated HWADDR (52:54:00:a6:1b:18)<->(192.168.122.24) IP pair into address cache
2025-05-06 11:46:06,767 aexpect.client client           L1244 DEBUG| Sending command: echo %OS%
2025-05-06 11:46:06,869 aexpect.client client           L1289 DEBUG| Sending command (safe): ifconfig -a
2025-05-06 11:46:06,974 aexpect.client client           L1289 DEBUG| Sending command (safe): ip link | grep -B1 '' -i
2025-05-06 11:46:07,077 aexpect.client client           L1244 DEBUG| Sending command: test -d /sys/class/net/enc1
2025-05-06 11:46:07,178 aexpect.client client           L1244 DEBUG| Sending command: echo $?
2025-05-06 11:46:07,279 aexpect.client client           L1244 DEBUG| Sending command: cat /sys/class/net/enc1/address
2025-05-06 11:46:07,382 aexpect.client client           L1289 DEBUG| Sending command (safe): ifconfig enc1 || ip address show enc1
2025-05-06 11:46:08,741 avocado.virttest.libvirt_vm libvirt_vm       L2382 DEBUG| Shutdown command sent; waiting for VM to go down...
2025-05-06 11:46:11,835 avocado.virttest.libvirt_vm libvirt_vm       L2386 DEBUG| VM is down
2025-05-06 11:46:11,835 avocado.virttest.env_process env_process      L1747 WARNI| registers is not alive. Can't query the avocado-vt-vm1 status
2025-05-06 11:47:11,848 aexpect.client client           L0433 WARNI| Failed to get lock, the aexpect_helper process might be left behind. Proceeding anyway...

@ldoktor
Copy link
Contributor Author

ldoktor commented May 7, 2025

@smitterl could you please add print of all processes in the system as well as free memory in case it fails to get the lock? Because if kill -9 fails it usually means there is something ugly happening in the system. As far as I know it can only happen when the process is in kernel mode (usually not enough memory or in uninterruptable wait in kernel - waiting for HW).

@smitterl
Copy link

smitterl commented May 7, 2025

@smitterl could you please add print of all processes in the system as well as free memory in case it fails to get the lock? Because if kill -9 fails it usually means there is something ugly happening in the system. As far as I know it can only happen when the process is in kernel mode (usually not enough memory or in uninterruptable wait in kernel - waiting for HW).

Sure. But I usually don't have means to identify the moment it fails to get the lock before our CI kills the test job because it only reproduces in CI.
As I can run the test suite anytime with whatever is in your branch, how about you add the traces you are interested in another commit for now?

@ldoktor ldoktor mentioned this pull request May 7, 2025
@ldoktor
Copy link
Contributor Author

ldoktor commented May 7, 2025

Something like #148

@smitterl
Copy link

smitterl commented May 7, 2025

Something like #148

Thank you for the quick response. I've triggered the test job.

@smitterl
Copy link

smitterl commented May 8, 2025

I had to retrigger the tests because my CI branch was out of date.

Not sure if useful: While I was debugging some other test just now it seemed to me to reproduce - but not with your patch. Still maybe this information is useful.

While it was stuck:

# ps aux|grep aex
root      457606  0.0  0.0  32344 27492 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      457759  0.1  0.0  32344 27360 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      458689  0.0  0.0   6460  1968 pts/5    S+   06:48   0:00 grep --color=auto aex
# kill -9 457606
# ps aux|grep aex
root      457606  0.0  0.0      0     0 pts/1    Z+   06:46   0:00 [aexpect_helper] <defunct>
root      457759  0.0  0.0  32344 27360 pts/1    S+   06:46   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      458812  0.0  0.0   6460  2032 pts/5    S+   06:48   0:00 grep --color=auto aex

At this point the test was still stuck, so I tried to kill the other one

# kill -9 457759
# ps aux|grep aex
root      457606  0.0  0.0      0     0 pts/1    Z+   06:46   0:00 [aexpect_helper] <defunct>
root      459044  6.2  0.0  32492 27488 pts/1    S+   06:48   0:00 /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/python /root/.libvirt-ci-venv-ci-runtest-fPZPi4/bin/aexpect_helper
root      459091  0.0  0.0   6460  2032 pts/5    S+   06:48   0:00 grep --color=auto aex

and the test continued. After it finished:

# ps aux|grep aex
root      460726  0.0  0.0   6460  2032 pts/5    S+   06:52   0:00 grep --color=auto aex

@smitterl
Copy link

smitterl commented May 8, 2025

I'm a bit concerned because the recent test runs lead to Jenkins losing connection to the host temporarily after some test ends in "FATAL: command execution failed".
No coredump was listed. But I think there must have been a reboot:

# uptime
07:35:56 up  1:31,  2 users,  load average: 0.00, 0.00, 0.00

The time stamps are a bit confusing but from that it would mean the system is up since about 06:24:56 - and my FATAL log before losing connection happens at timestamp 06:25:09 at that system must have been up at least since 06:17:39 according to the previous CI logs.

This is reproducible, I estimate to about 100% with the failed-kill*-debug* branch.

@pevogam
Copy link
Contributor

pevogam commented May 12, 2025

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

@smitterl
Copy link

My last 3 runs with this patch also succeeded and the original issue didn't reproduce. My test runs without this patch had reproduced it rather reliably so this should work now. I'm just rerunning without this patch to double check if it is needed. But in any case I'm in favor of merging this because from what I understand it will be an improvement for the handling of sessions.

@smitterl
Copy link

FWIW, without this patch it didn't reproduce now either but, as said earlier, I'm in favor of accepting this PR to improve the session handling and there have been no indications of regressions introduced by it.

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ldoktor,

This looks alright to me, but it doesn't hurt to be extra careful and prevent freezes if for some reason the machine clock changes (see my comment about the monotonic timer use).

PS: thanks for the reproducer, it was invaluable.

lock_flags = fcntl.LOCK_EX
if timeout > 0:
lock_flags |= fcntl.LOCK_NB
end_time = time.time() + timeout if timeout > 0 else -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One good practice here is to use time.monotonic() instead of regular "wall clock".

@ldoktor
Copy link
Contributor Author

ldoktor commented May 21, 2025

Hi @ldoktor,

This looks alright to me, but it doesn't hurt to be extra careful and prevent freezes if for some reason the machine clock changes (see my comment about the monotonic timer use).

PS: thanks for the reproducer, it was invaluable.

Thanks for the suggestion, I treated is globally in this project in a separate commit.

ldoktor added 4 commits May 27, 2025 13:09
In case aexpect_helper kill fails we end-up waiting for infinity (eg.
when in system call or unschedulable state). Let's change this to
up-to-60s wait and then proceeding with warning, which might result in
left-behind aexpect_helper but should proceed with other testing.

Signed-off-by: Lukáš Doktor <[email protected]>
The POSIX lockf doesn't protect other threads accessing the lock while
BSD lock does. Let's switch to BSD locks to protect against multiple
threads accessing the same sections.

Signed-off-by: Lukáš Doktor <[email protected]>
The Spawn.close() contains critical section that should not be accessed
multiple times, which might happen when a user uses aexpect from
different threads/processes. Let's add a lock to protect it.

Signed-off-by: Lukáš Doktor <[email protected]>
to avoid problems with NTP (or iffy clock sources) let's use
time.monotonic() for time difference operations.

Signed-off-by: Lukáš Doktor <[email protected]>
@ldoktor
Copy link
Contributor Author

ldoktor commented May 27, 2025

Changes:

  • rebased with the black linter

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ldoktor
Copy link
Contributor Author

ldoktor commented May 28, 2025

Thanks, the failure is in a different file, let's treat it separately

@ldoktor ldoktor merged commit 659ac1e into avocado-framework:main May 28, 2025
2 of 3 checks passed
@ldoktor
Copy link
Contributor Author

ldoktor commented May 28, 2025

Oh no, it was actually part of this PR... Anyway already merged so let me address that.

@pevogam
Copy link
Contributor

pevogam commented Jun 12, 2025

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

@ldoktor
Copy link
Contributor Author

ldoktor commented Jun 13, 2025

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

Hello @pevogam I'm not sure I'm following the question. Are you stating this Bad file descriptor occurs as a new issue with this PR, or is it something that actually got resolved (unexpectedly) by this PR? Or is it a real issue or just a minor warning?

@pevogam
Copy link
Contributor

pevogam commented Jun 13, 2025

Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality.

But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this?

[stderr] Traceback (most recent call last):                                                                                                                                                                                                │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/core.py", line 100, in do_cleanup                                                                                                                                   │
[stderr]     self.__setupers.pop().cleanup()                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/test_setup/networking.py", line 119, in cleanup                                                                                                                                │
[stderr]     self.env.stop_ip_sniffing()                                                                                                                                                                                                   │
[stderr]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                   │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/utils_env.py", line 274, in stop_ip_sniffing                                                                                                                                   │
[stderr]     self._sniffer.stop()                                                                                                                                                                                                          │
[stderr]     ~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                          │
[stderr]   File "/usr/lib/python3.13/site-packages/virttest/ip_sniffing.py", line 237, in stop                                                                                                                                             │
[stderr]     self._process.close()                                                                                                                                                                                                         │
[stderr]     ~~~~~~~~~~~~~~~~~~~^^                                                                                                                                                                                                         │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/client.py", line 474, in close                                                                                                                                                  │
[stderr]     unlock_fd(lock)                                                                                                                                                                                                               │
[stderr]     ~~~~~~~~~^^^^^^                                                                                                                                                                                                               │
[stderr]   File "/usr/lib/python3.13/site-packages/aexpect/shared.py", line 46, in unlock_fd                                                                                                                                               │
[stderr]     fcntl.flock(lock_fd, fcntl.LOCK_UN)                                                                                                                                                                                           │
[stderr]     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           │
[stderr] OSError: [Errno 9] Bad file descriptor

Hello @pevogam I'm not sure I'm following the question. Are you stating this Bad file descriptor occurs as a new issue with this PR, or is it something that actually got resolved (unexpectedly) by this PR? Or is it a real issue or just a minor warning?

In short, I confirmed in the past this PR does not cause any damage and was using Python 3.12. However, testing currently with Python 3.13 it does cause an issue and I could confirm that by removing this exact set of patches to have passing tests again (with Python 3.13. So unfortunately it is an issue that this PR causes and I can only observe it with the newest stable Python 3.13. It seems like a minor issue since our VT tests succeed for the most part but end up with a FAIL status due to such unclosable aexpect helpers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants