-
Notifications
You must be signed in to change notification settings - Fork 34
Failed kill #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed kill #147
Conversation
| lock = None | ||
| try: | ||
| try: | ||
| lock = get_lock_fd(self._close_lockfile, timeout=60) |
Check warning
Code scanning / CodeQL
File is not always closed Warning
| :return: True on success, False on failure/timeout | ||
| """ | ||
| try: | ||
| lock_fd = get_lock_fd(filename, timeout) |
Check warning
Code scanning / CodeQL
File is not always closed Warning
|
@pevogam could you please take a look at this? I'm not completely sure we don't rely on the same threads being able to re-entry the critical sections so thorough testing would be appreciated. |
Sure, added to my backlog in the coming days. |
|
I haven't had time to review yet. I'm running my test jobs with this patch. So far no regression and no occurrence of the timeout issue. |
Unfortunately, just after posting this comment I found another hang, this time the last logged error message was "2025-05-06 11:47:11,848 aexpect.client client L0433 WARNI| Failed to get lock, the aexpect_helper process might be left behind. Proceeding anyway..." The hang apparently happened again during session.close for serial cleanup: |
|
@smitterl could you please add print of all processes in the system as well as free memory in case it fails to get the lock? Because if |
Sure. But I usually don't have means to identify the moment it fails to get the lock before our CI kills the test job because it only reproduces in CI. |
|
Something like #148 |
Thank you for the quick response. I've triggered the test job. |
|
I had to retrigger the tests because my CI branch was out of date. Not sure if useful: While I was debugging some other test just now it seemed to me to reproduce - but not with your patch. Still maybe this information is useful. While it was stuck: At this point the test was still stuck, so I tried to kill the other one and the test continued. After it finished: |
|
I'm a bit concerned because the recent test runs lead to Jenkins losing connection to the host temporarily after some test ends in "FATAL: command execution failed". The time stamps are a bit confusing but from that it would mean the system is up since about This is reproducible, I estimate to about 100% with the failed-kill*-debug* branch. |
|
Just for some info on my side - I ran integration tests with this pull request and everything passed - so whether it actually fixes what it aims to fix or not at least I can confirm that it doesn't regress the current functionality. |
|
My last 3 runs with this patch also succeeded and the original issue didn't reproduce. My test runs without this patch had reproduced it rather reliably so this should work now. I'm just rerunning without this patch to double check if it is needed. But in any case I'm in favor of merging this because from what I understand it will be an improvement for the handling of sessions. |
|
FWIW, without this patch it didn't reproduce now either but, as said earlier, I'm in favor of accepting this PR to improve the session handling and there have been no indications of regressions introduced by it. |
clebergnu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ldoktor,
This looks alright to me, but it doesn't hurt to be extra careful and prevent freezes if for some reason the machine clock changes (see my comment about the monotonic timer use).
PS: thanks for the reproducer, it was invaluable.
aexpect/shared.py
Outdated
| lock_flags = fcntl.LOCK_EX | ||
| if timeout > 0: | ||
| lock_flags |= fcntl.LOCK_NB | ||
| end_time = time.time() + timeout if timeout > 0 else -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One good practice here is to use time.monotonic() instead of regular "wall clock".
Thanks for the suggestion, I treated is globally in this project in a separate commit. |
In case aexpect_helper kill fails we end-up waiting for infinity (eg. when in system call or unschedulable state). Let's change this to up-to-60s wait and then proceeding with warning, which might result in left-behind aexpect_helper but should proceed with other testing. Signed-off-by: Lukáš Doktor <[email protected]>
The POSIX lockf doesn't protect other threads accessing the lock while BSD lock does. Let's switch to BSD locks to protect against multiple threads accessing the same sections. Signed-off-by: Lukáš Doktor <[email protected]>
The Spawn.close() contains critical section that should not be accessed multiple times, which might happen when a user uses aexpect from different threads/processes. Let's add a lock to protect it. Signed-off-by: Lukáš Doktor <[email protected]>
to avoid problems with NTP (or iffy clock sources) let's use time.monotonic() for time difference operations. Signed-off-by: Lukáš Doktor <[email protected]>
|
Changes:
|
clebergnu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
Thanks, the failure is in a different file, let's treat it separately |
|
Oh no, it was actually part of this PR... Anyway already merged so let me address that. |
But I guess I should clarify - it did not appear in Python 3.12 and it does appear in some tests we have with Python 3.13. Does the error this was trying to fix look like this? |
Hello @pevogam I'm not sure I'm following the question. Are you stating this |
In short, I confirmed in the past this PR does not cause any damage and was using Python 3.12. However, testing currently with Python 3.13 it does cause an issue and I could confirm that by removing this exact set of patches to have passing tests again (with Python 3.13. So unfortunately it is an issue that this PR causes and I can only observe it with the newest stable Python 3.13. It seems like a minor issue since our VT tests succeed for the most part but end up with a FAIL status due to such unclosable aexpect helpers. |
@smitterl identified a bug in aexpect, when unkillable
aexpect_workerresults in indefinite hang ofSpawn.close(). To mitigate that I added a timeout to ourwait_for_lockand used it inclose()(this fix can be tested by commenting-out thekill). Then I noticed theclose()can actually be called multiple times when the user uses multiple processes/threads. To address that I'd suggest moving to BSD flock rather than posix lockf as that one protects individual threads and adding anothercloselock file to protect this critical path. This can be tested by:One thing I'm not completely sure about is whether we don't rely on the non-thread-safe lockf in aexpect so please test this thoroughly.