Please confirm the following
Bug Summary
Summary
When running AWX with multiple task replicas (>1), jobs fail immediately
with FileExistsError on /var/lib/awx/projects when triggered in parallel.
The root cause is a TOCTOU race condition in acquire_lock().
AWX Version
24.6.1 (also present in latest devel branch as of 2026-05-05)
Steps to Reproduce
- Deploy AWX with
replicas: 3 and a RWX PVC for projects (CephFS/NFS)
- Trigger 2+ jobs simultaneously targeting different projects
- Observe immediate failure on some jobs
Error
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Root Cause
In awx/main/tasks/jobs.py, the acquire_lock() function uses a
non-atomic check-then-act pattern on PROJECTS_ROOT:
# Current code - TOCTOU race condition
if not os.path.exists(settings.PROJECTS_ROOT):
os.mkdir(settings.PROJECTS_ROOT)
With multiple task pods running concurrently, all pods can pass the
os.path.exists() check simultaneously before any of them creates
the directory, causing all but the first to raise FileExistsError.
Note: the per-project locking mechanism using fcntl.lockf() is
correctly implemented and unaffected by this bug.
Proposed Fix
Replace the non-atomic pattern with the atomic os.makedirs():
# Fix - atomic and idempotent
os.makedirs(settings.PROJECTS_ROOT, exist_ok=True)
This is a one-line fix. exist_ok=True makes the call a no-op if
the directory already exists, eliminating the race condition entirely.
Workaround
Reduce task replicas to 1. This eliminates the race condition but
removes task HA.
Additional Context
- Confirmed present in
devel branch as of 2026-05-05
- PVC access mode:
ReadWriteMany (CephFS)
- Operator version: 2.19.1
- The bug is triggered even when parallel jobs target different
projects, since all jobs pass through this PROJECTS_ROOT check
before reaching their individual project lock path
AWX version
24.6.1
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Steps to Reproduce
- Deploy AWX with
replicas: 3 on Kubernetes
- Configure a RWX PVC for projects storage (CephFS or NFS)
- Create 2+ job templates pointing to different projects
- Trigger all jobs simultaneously (e.g. via scheduled jobs
or API calls at the same time)
- Observe that some jobs fail immediately before playbook execution
Expected Behavior
All jobs should start normally regardless of how many task replicas
are running or how many jobs are triggered simultaneously.
Actual Behavior
Some jobs fail immediately with:
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
The failure rate increases with the number of task replicas and the number of simultaneous jobs.
Expected results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Actual results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Additional information
No response
Please confirm the following
security@ansible.cominstead.)Bug Summary
Summary
When running AWX with multiple task replicas (>1), jobs fail immediately
with
FileExistsErroron/var/lib/awx/projectswhen triggered in parallel.The root cause is a TOCTOU race condition in
acquire_lock().AWX Version
24.6.1 (also present in latest devel branch as of 2026-05-05)
Steps to Reproduce
replicas: 3and a RWX PVC for projects (CephFS/NFS)Error
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Root Cause
In
awx/main/tasks/jobs.py, theacquire_lock()function uses anon-atomic check-then-act pattern on
PROJECTS_ROOT:With multiple task pods running concurrently, all pods can pass the
os.path.exists()check simultaneously before any of them createsthe directory, causing all but the first to raise
FileExistsError.Note: the per-project locking mechanism using
fcntl.lockf()iscorrectly implemented and unaffected by this bug.
Proposed Fix
Replace the non-atomic pattern with the atomic
os.makedirs():This is a one-line fix.
exist_ok=Truemakes the call a no-op ifthe directory already exists, eliminating the race condition entirely.
Workaround
Reduce task replicas to 1. This eliminates the race condition but
removes task HA.
Additional Context
develbranch as of 2026-05-05ReadWriteMany(CephFS)projects, since all jobs pass through this PROJECTS_ROOT check
before reaching their individual project lock path
AWX version
24.6.1
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Steps to Reproduce
replicas: 3on Kubernetesor API calls at the same time)
Expected Behavior
All jobs should start normally regardless of how many task replicas
are running or how many jobs are triggered simultaneously.
Actual Behavior
Some jobs fail immediately with:
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
The failure rate increases with the number of task replicas and the number of simultaneous jobs.
Expected results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Actual results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock
os.mkdir(settings.PROJECTS_ROOT)
FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Additional information
No response