Fix race condition of un-expired cache in local workers #16256
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



SUMMARY
Still seeing in live PRs:
#16255
I'm fairly sure that I know what's happening in the case of the github runner.
This clears the setting value in both forms of cache:
The problem, however, is that the project update runs in a worker, of where there are 4 by default. The in-memory cache expires in 5 seconds. However, I expect it to take less than 5 seconds for the first project update to start in most cases. That means that in some cases it will be picking up the stale value of
AWX_ISOLATION_SHOW_PATHS.That would predict a 3 out of 4 chance of hitting the failure. But we have been seeing it much less frequently than that. So what gives? Well, out of the 4 workers, its possible (likely actually) that many of them won't have a pre-existing value. That is saying, some workers have ran a job in the last 5 seconds, but others have not. This starts to match very well the observed frequency and arrives at a true theory for reproduction. If you saturate the workers with unrelated jobs first, then you populate the in-memory cache, and then you could introduce a 3 of 4 or 4 of 4 chance of hitting the failure due to in-memory cache. Specifically, that in-memory cache is:
awx/awx/conf/settings.py
Line 36 in 99dce79
and
awx/awx/conf/settings.py
Line 260 in 99dce79
So I hope I've convinced you of the mechanic now. Accepting this theory, the solution is obvious. The sleep must equal the in-memory cache expiration time (absent some better way of telling workers to clear their cache).
This looks very bad for test performance. This fixture is used in many tests. However, the
ifwe are under should only be hit upon the first application of the fixture - that is, once per test run. So I'll accept the 5 additional seconds to eliminate this failure we are still seeing.ISSUE TYPE
COMPONENT NAME
Note
Low Risk
Low risk: change is limited to live test setup timing; primary impact is slightly longer test runtime and potential for added flakiness if 5s is still insufficient on slow CI.
Overview
Reduces live-test flakiness caused by stale per-worker in-memory settings cache for
AWX_ISOLATION_SHOW_PATHSby increasing the post-clear_setting_cachedelay from0.2sto5.0sin thelive_tmp_foldersession fixture, ensuring worker-local TTL caches have time to expire before project updates start.Written by Cursor Bugbot for commit 9f365bc. This will update automatically on new commits. Configure here.