Skip to content

Commit 1538e9f

Browse files
chrisguidryclaude
andcommitted
Serialize cluster image builds with file lock
The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent aace0e1 commit 1538e9f

File tree

1 file changed

+21
-8
lines changed

1 file changed

+21
-8
lines changed

tests/_container.py

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@
44
including single-node Redis, Redis Cluster, and Valkey variants.
55
"""
66

7+
import fcntl
78
import os
89
import socket
10+
import tempfile
911
import time
1012
from contextlib import contextmanager
1113
from datetime import datetime, timedelta, timezone
@@ -167,7 +169,13 @@ def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
167169

168170

169171
def build_cluster_image(client: DockerClient, base_image: str) -> str:
170-
"""Build cluster image from base image, return image tag."""
172+
"""Build cluster image from base image, return image tag.
173+
174+
Uses a file lock to serialize builds across parallel xdist workers.
175+
Without the lock, concurrent builds for the same tag race: one build
176+
re-tags the image and the Docker SDK's post-build inspect on the
177+
loser's image ID gets a 404.
178+
"""
171179
tag = f"docket-cluster:{base_image.replace('/', '-').replace(':', '-')}"
172180

173181
try:
@@ -176,18 +184,23 @@ def build_cluster_image(client: DockerClient, base_image: str) -> str:
176184
except docker.errors.ImageNotFound:
177185
pass
178186

179-
cluster_dir = Path(__file__).parent / "cluster"
180-
try:
187+
lock_path = Path(tempfile.gettempdir()) / f"docket-{tag.replace(':', '-')}.lock"
188+
with open(lock_path, "w") as lock_file:
189+
fcntl.flock(lock_file, fcntl.LOCK_EX)
190+
191+
# Re-check after acquiring lock; another worker may have built it
192+
try:
193+
client.images.get(tag)
194+
return tag
195+
except docker.errors.ImageNotFound:
196+
pass
197+
198+
cluster_dir = Path(__file__).parent / "cluster"
181199
client.images.build(
182200
path=str(cluster_dir),
183201
tag=tag,
184202
buildargs={"BASE_IMAGE": base_image},
185203
)
186-
except docker.errors.BuildError as e:
187-
if "AlreadyExists" in str(e):
188-
pass
189-
else:
190-
raise
191204
return tag
192205

193206

0 commit comments

Comments
 (0)