Skip to content

Commit f410dfb

Browse files
chrisguidryclaude
andauthored
Serialize cluster image builds and split CLI tests into separate CI job (#338)
Two changes to improve CI reliability: **Serialize cluster image builds with file lock** The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. **Split CLI tests into separate CI job** Cluster CI jobs consistently run right at the 4-minute timeout, and when any test runs slightly slow the whole job gets cancelled. This has been showing up in roughly a third of recent CI runs: https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074 The 91 CLI tests are subprocess-based and don't exercise backend-specific behavior — they spawn `python -m docket ...` processes and check output. Running them against every Python x Backend combination (30 matrix entries) is wasted effort. This moves CLI tests to their own job that varies by Python version but uses a single Redis backend (8.0). The main test matrix now passes `--ignore=tests/cli` so cluster/valkey/memory jobs only run the tests that actually care about the backend. Local `pytest` runs are unaffected. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent aace0e1 commit f410dfb

File tree

14 files changed

+126
-46
lines changed

14 files changed

+126
-46
lines changed

.coveragerc-cli

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[run]
2+
branch = true
3+
parallel = true
4+
5+
[report]
6+
include =
7+
src/docket/cli/*
8+
tests/cli/*
9+
exclude_also = ["\\.\\.\\.$"]

.coveragerc-core

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[run]
2+
branch = true
3+
parallel = true
4+
omit =
5+
src/docket/__main__.py
6+
src/docket/_uuid7.py
7+
src/docket/_prometheus_exporter.py
8+
src/docket/_result_store.py
9+
src/docket/cli/*.py
10+
**/conftest.py
11+
tests/_key_leak_checker.py
12+
tests/_container.py
13+
tests/cli/*.py
14+
15+
[report]
16+
exclude_also = ["\\.\\.\\.$"]

.coveragerc-memory

Lines changed: 0 additions & 17 deletions
This file was deleted.

.github/codecov.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
ignore:
2+
- "src/docket/__main__.py"
3+
- "src/docket/_uuid7.py"
4+
- "src/docket/_prometheus_exporter.py"
5+
- "src/docket/_result_store.py"
6+
- "**/conftest.py"
7+
- "tests/_key_leak_checker.py"
8+
- "tests/_container.py"
9+
110
coverage:
211
status:
312
project:

.github/workflows/ci.yml

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ permissions:
1111

1212
jobs:
1313
test:
14-
name: Test Python ${{ matrix.python-version }}, ${{ matrix.backend.name }}
14+
name: Core Tests, Python ${{ matrix.python-version }}, ${{ matrix.backend.name }}
1515
runs-on: ubuntu-latest
1616
timeout-minutes: 4
1717
strategy:
@@ -54,15 +54,14 @@ jobs:
5454
- python-version: "3.14"
5555
cov-threshold: 100
5656
pytest-args: ""
57-
# Memory backend: CLI tests are skipped via pytest skip markers because
58-
# CLI rejects memory:// URLs. Use separate coverage config to exclude CLI.
59-
# Coverage threshold is lower due to pytest-xdist coverage data combination issues.
57+
# Memory backend: coverage threshold is lower due to pytest-xdist
58+
# coverage data combination issues.
6059
- backend:
6160
name: "Memory (in-memory backend)"
6261
redis-version: "memory"
6362
redis-py-version: ">=5"
6463
cov-threshold: 95
65-
pytest-args: "--cov-config=.coveragerc-memory"
64+
pytest-args: ""
6665
# Debug hanging tests in 3.12 cluster mode
6766
- python-version: "3.12"
6867
backend:
@@ -129,14 +128,54 @@ jobs:
129128
- name: Run tests
130129
env:
131130
REDIS_VERSION: ${{ matrix.backend.redis-version }}
132-
run: uv run pytest --cov-branch --cov-fail-under=${{ matrix.cov-threshold }} --cov-report=xml --cov-report=term-missing:skip-covered ${{ matrix.pytest-args }}
131+
run: uv run pytest --ignore=tests/cli --cov-branch --cov-config=.coveragerc-core --cov-fail-under=${{ matrix.cov-threshold }} --cov-report=xml --cov-report=term-missing:skip-covered ${{ matrix.pytest-args }}
133132

134133
- name: Upload coverage reports to Codecov
135134
uses: codecov/codecov-action@v5
136135
with:
137136
token: ${{ secrets.CODECOV_TOKEN }}
138137
flags: python-${{ matrix.python-version }}
139138

139+
test-cli:
140+
name: CLI Tests, Python ${{ matrix.python-version }}
141+
runs-on: ubuntu-latest
142+
timeout-minutes: 4
143+
strategy:
144+
fail-fast: false
145+
matrix:
146+
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
147+
steps:
148+
- uses: actions/checkout@v6
149+
150+
- name: Install uv and set Python version
151+
uses: astral-sh/setup-uv@v7
152+
with:
153+
python-version: ${{ matrix.python-version }}
154+
enable-cache: true
155+
cache-dependency-glob: "pyproject.toml"
156+
157+
- name: Install dependencies
158+
run: uv sync --upgrade-package 'redis>=5'
159+
160+
- name: Run CLI tests
161+
env:
162+
REDIS_VERSION: "8.0"
163+
run: >
164+
uv run pytest tests/cli/
165+
-o 'addopts=--import-mode=importlib'
166+
--numprocesses=logical --maxprocesses=4
167+
--cov=src/docket/cli --cov=tests/cli
168+
--cov-config=.coveragerc-cli
169+
--cov-branch --cov-fail-under=100
170+
--cov-report=xml --cov-report=term-missing:skip-covered
171+
--timeout=30
172+
173+
- name: Upload coverage reports to Codecov
174+
uses: codecov/codecov-action@v5
175+
with:
176+
token: ${{ secrets.CODECOV_TOKEN }}
177+
flags: cli-python-${{ matrix.python-version }}
178+
140179
prek:
141180
name: Prek checks
142181
runs-on: ubuntu-latest

loq.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ path = "src/docket/worker.py"
1313
max_lines = 1126
1414

1515
[[rules]]
16-
path = "src/docket/cli.py"
16+
path = "src/docket/cli/__init__.py"
1717
max_lines = 945
1818

1919
[[rules]]

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ addopts = [
115115
"--cov=tests",
116116
"--cov-report=term-missing:skip-covered",
117117
"--cov-branch",
118+
"--cov-fail-under=100",
118119
"--timeout=30",
119120
"--import-mode=importlib",
120121
]
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@
1919
)
2020
from rich.table import Table
2121

22-
from . import __version__, tasks
23-
from ._cli_support import (
22+
from docket import __version__, tasks
23+
from docket.cli._support import (
2424
LogLevel,
2525
LogFormat,
2626
default_worker_name,
@@ -36,10 +36,10 @@
3636
set_logging_level,
3737
validate_url,
3838
)
39-
from .docket import Docket, DocketSnapshot, WorkerInfo
40-
from .execution import ExecutionState
41-
from .strikelist import Operator
42-
from .worker import Worker
39+
from docket.docket import Docket, DocketSnapshot, WorkerInfo
40+
from docket.execution import ExecutionState
41+
from docket.strikelist import Operator
42+
from docket.worker import Worker
4343

4444

4545
app: typer.Typer = typer.Typer(
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from rich.console import Console
1616
from rich.table import Table
1717

18-
from .docket import DocketSnapshot, WorkerInfo
18+
from docket.docket import DocketSnapshot, WorkerInfo
1919

2020

2121
async def iterate_with_timeout(

tests/_container.py

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@
44
including single-node Redis, Redis Cluster, and Valkey variants.
55
"""
66

7+
import fcntl
78
import os
89
import socket
10+
import tempfile
911
import time
1012
from contextlib import contextmanager
1113
from datetime import datetime, timedelta, timezone
@@ -167,7 +169,13 @@ def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
167169

168170

169171
def build_cluster_image(client: DockerClient, base_image: str) -> str:
170-
"""Build cluster image from base image, return image tag."""
172+
"""Build cluster image from base image, return image tag.
173+
174+
Uses a file lock to serialize builds across parallel xdist workers.
175+
Without the lock, concurrent builds for the same tag race: one build
176+
re-tags the image and the Docker SDK's post-build inspect on the
177+
loser's image ID gets a 404.
178+
"""
171179
tag = f"docket-cluster:{base_image.replace('/', '-').replace(':', '-')}"
172180

173181
try:
@@ -176,18 +184,23 @@ def build_cluster_image(client: DockerClient, base_image: str) -> str:
176184
except docker.errors.ImageNotFound:
177185
pass
178186

179-
cluster_dir = Path(__file__).parent / "cluster"
180-
try:
187+
lock_path = Path(tempfile.gettempdir()) / f"docket-{tag.replace(':', '-')}.lock"
188+
with open(lock_path, "w") as lock_file:
189+
fcntl.flock(lock_file, fcntl.LOCK_EX)
190+
191+
# Re-check after acquiring lock; another worker may have built it
192+
try:
193+
client.images.get(tag)
194+
return tag
195+
except docker.errors.ImageNotFound:
196+
pass
197+
198+
cluster_dir = Path(__file__).parent / "cluster"
181199
client.images.build(
182200
path=str(cluster_dir),
183201
tag=tag,
184202
buildargs={"BASE_IMAGE": base_image},
185203
)
186-
except docker.errors.BuildError as e:
187-
if "AlreadyExists" in str(e):
188-
pass
189-
else:
190-
raise
191204
return tag
192205

193206

0 commit comments

Comments
 (0)