Skip to content

Commit cbc15a0

Browse files
authored
ci: parallelize unit:pytest:core with pytest-xdist (#56)
Reintroduces `-n auto` for the core test suite after diagnosing and fixing the root cause of the earlier xdist race. Root cause ---------- `StackManager.__init__` and `StackManager.deploy` both invoke self-healing Lambda-package builders (`_ensure_lambda_build` and `_rebuild_lambda_packages`) which `rm -rf` and repopulate the real `lambda/kubectl-applier-simple-build/` tree. Those are correct behaviors for `gco stacks deploy`, but during tests they race with CDK's `Code.from_asset()` on other xdist workers — one worker's mid-`rm -rf` window intersects another worker's `copyDirectory`, producing the sporadic `ENOENT ... lstat '...botocore/data/sagemaker'` failures we saw before. A rm+pip-install cycle on the real tree is ~25 s, and tracing showed every `test_deploy_*` plus every `StackManager(config)` construction without a `project_root` kwarg was triggering it. At four workers on a 2-vCPU CI runner, that window intersects a CDK synth with near-certainty. Fix --- A session-scoped autouse fixture in `tests/conftest.py` (`_neutralize_lambda_build`) patches both methods to short-circuit when `self.project_root` resolves to the real repo root. Tests that intentionally exercise these methods against a `tmp_path` keep working because the guard only skips real-root calls. The composite action `.github/actions/build-lambda-package` handles the population in CI; the existing `ensure_lambda_build_dirs` fixture covers local-dev. Bumped hypothesis per-example deadlines on the four analytics property tests that run full-app CDK synths (`test_analytics_bucket_isolation_property`, `test_analytics_cluster_shared_configmap_property`, `test_analytics_configmap_property`, `test_analytics_roundtrip_property`) to 20 s (and 10 s for the cheaper ones that were at 5 s). CDK synth contention under xdist was pushing the first (uncached) example over the old limit. Verification ------------ Full suite with `-n auto --dist=load` on an 8-core local machine: 3980 passed, 1 skipped in 353s (first run) 3980 passed, 1 skipped in 320s (second run) Both runs clean, no flakes. Workflow change --------------- `unit:pytest:core` now runs with `-n auto --dist=load --maxfail=1`. `--maxfail=1` preserves the previous `-x` semantics (stop at first failure); `-x` itself isn't compatible with xdist.
1 parent 7c95341 commit cbc15a0

6 files changed

Lines changed: 91 additions & 20 deletions

.github/workflows/unit-tests.yml

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -91,18 +91,20 @@ jobs:
9191
pip install -e ".[dev,mcp]"
9292
- uses: ./.github/actions/build-lambda-package
9393
- name: Run pytest with coverage
94-
# Parallelism (pytest-xdist -n auto) was attempted here but the
95-
# CDK-heavy stack tests (test_regional_stack, test_stacks, etc.)
96-
# race on CDK's in-process asset-staging cache: two workers both
97-
# stage ``lambda/kubectl-applier-simple-build`` into the shared
98-
# ``cdk.out/asset.<hash>/`` destination and one hits ENOENT on the
99-
# source mid-copy. Neither ``--dist=loadfile`` nor ``--dist=loadscope``
100-
# fixes it because the race is cross-file (multiple test modules
101-
# instantiate GCORegionalStack, which uses the same asset). Running
102-
# serially with ``-x`` preserves correctness at the cost of the
103-
# xdist wall-clock speedup. The two dedicated CDK jobs
104-
# (unit:cdk:config-matrix + unit:cdk:nag-compliance) do benefit
105-
# from parallelism and are wired for it in their own workflows.
94+
# -n auto distributes tests across all available CPU cores via
95+
# pytest-xdist. Every test is xdist-safe because
96+
# tests/conftest.py::_neutralize_lambda_build patches
97+
# StackManager._ensure_lambda_build and _rebuild_lambda_packages
98+
# so tests can't rebuild the real lambda/kubectl-applier-simple-build
99+
# tree mid-run — that rebuild is what CDK's Code.from_asset() races
100+
# against when two workers synthesize stacks concurrently. The
101+
# session-wide patch guards on `project_root` so the handful of
102+
# tests that legitimately exercise these methods against a
103+
# `tmp_path` keep working.
104+
#
105+
# --dist=load (xdist's default) round-robins individual test items
106+
# across workers. --maxfail=1 matches the previous -x "stop at
107+
# first failure" behavior — `-x` itself isn't compatible with xdist.
106108
run: |
107109
pytest tests/ -v \
108110
--ignore=tests/test_integration.py \
@@ -112,7 +114,8 @@ jobs:
112114
--cov-report=xml --cov-report=html --cov-report=json \
113115
--cov-report=term-missing \
114116
--cov-fail-under=90 \
115-
--junitxml=report.xml -x
117+
--junitxml=report.xml \
118+
-n auto --maxfail=1
116119
- name: Upload coverage artifacts
117120
if: always()
118121
uses: actions/upload-artifact@v7

tests/conftest.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,72 @@ def ensure_lambda_build_dirs():
8080
shutil.rmtree(pycache)
8181

8282

83+
# ============================================================================
84+
# Session-scoped: neutralize StackManager's self-healing Lambda rebuild during tests
85+
# ============================================================================
86+
#
87+
# ``StackManager.__init__`` calls ``_ensure_lambda_build()`` (and its downstream
88+
# ``_build_kubectl_lambda``) as a self-healing step so any ``gco stacks
89+
# deploy`` succeeds even when a contributor's build tree is stale. That's the
90+
# right behavior at runtime, but it's destructive during tests:
91+
#
92+
# 1. ``_build_kubectl_lambda`` does ``_safe_rmtree(build_dir)`` on the *real*
93+
# ``lambda/kubectl-applier-simple-build/`` whenever its guard (``yaml/``
94+
# missing) trips.
95+
# 2. Under pytest-xdist, one worker's rebuild races with another worker's
96+
# CDK ``Code.from_asset()`` mid-copy, producing the sporadic
97+
# ``ENOENT: … lstat '…lambda/kubectl-applier-simple-build/botocore/data/…``
98+
# failures we see on the 2-vCPU CI runner.
99+
# 3. Any test that mocks ``subprocess.run`` while constructing a
100+
# ``StackManager`` can silently short-circuit the pip-install step and
101+
# leave the build tree partially populated, which then trips the guard
102+
# on the NEXT construction and cascades a rebuild.
103+
# 4. ``deploy()`` calls ``_rebuild_lambda_packages()`` which rm-trees and
104+
# pip-installs into the real build dir even when ``_run_cdk`` is
105+
# mocked — so every ``test_deploy_*`` hits the real filesystem too.
106+
#
107+
# Tests should never rebuild the *real* Lambda tree. The composite action
108+
# (``.github/actions/build-lambda-package``) populates it before pytest runs
109+
# in CI, and ``ensure_lambda_build_dirs`` above handles the local-dev case.
110+
# Patching ``_ensure_lambda_build`` and ``_rebuild_lambda_packages`` to skip
111+
# when ``project_root`` points at the real repo makes xdist safe; tests that
112+
# intentionally exercise these methods against a ``tmp_path`` keep working
113+
# because the guard lets them through.
114+
@pytest.fixture(scope="session", autouse=True)
115+
def _neutralize_lambda_build(ensure_lambda_build_dirs): # noqa: ARG001 — dep order only
116+
from cli import stacks as _stacks
117+
118+
real_root = PROJECT_ROOT.resolve()
119+
orig_ensure = _stacks.StackManager._ensure_lambda_build
120+
orig_rebuild = _stacks.StackManager._rebuild_lambda_packages
121+
122+
def _guarded_ensure(self):
123+
try:
124+
same = Path(self.project_root).resolve() == real_root
125+
except OSError:
126+
same = False
127+
if same:
128+
return
129+
return orig_ensure(self)
130+
131+
def _guarded_rebuild(self):
132+
try:
133+
same = Path(self.project_root).resolve() == real_root
134+
except OSError:
135+
same = False
136+
if same:
137+
return
138+
return orig_rebuild(self)
139+
140+
_stacks.StackManager._ensure_lambda_build = _guarded_ensure
141+
_stacks.StackManager._rebuild_lambda_packages = _guarded_rebuild
142+
try:
143+
yield
144+
finally:
145+
_stacks.StackManager._ensure_lambda_build = orig_ensure
146+
_stacks.StackManager._rebuild_lambda_packages = orig_rebuild
147+
148+
83149
# ============================================================================
84150
# Model Fixtures
85151
# ============================================================================

tests/test_analytics_bucket_isolation_property.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@
3535
combinations)`` strategy space is small enough that
3636
:func:`functools.cache` on
3737
``(enabled, hyperpod, tuple(sorted(regions)))`` keeps the hot loop
38-
under the ``deadline=10000`` ms per-example budget.
38+
under the ``deadline=20000`` ms per-example budget, which also leaves
39+
headroom for the first (uncached) synth in each worker when the suite
40+
runs under pytest-xdist contention.
3941
``max_examples=50`` with caching completes in under 90 s on the
4042
benchmark workstation.
4143
"""
@@ -297,7 +299,7 @@ def setup_class(cls) -> None:
297299

298300
@settings(
299301
max_examples=50,
300-
deadline=10000,
302+
deadline=20000,
301303
suppress_health_check=[
302304
HealthCheck.too_slow,
303305
HealthCheck.function_scoped_fixture,

tests/test_analytics_cluster_shared_configmap_property.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,7 @@ def setup_class(cls) -> None:
257257

258258
@settings(
259259
max_examples=50,
260-
deadline=5000,
260+
deadline=10000,
261261
suppress_health_check=[
262262
HealthCheck.too_slow,
263263
HealthCheck.function_scoped_fixture,
@@ -335,7 +335,7 @@ class TestComputeKubectlClusterSharedReplacementsRoundTrip:
335335

336336
@settings(
337337
max_examples=50,
338-
deadline=5000,
338+
deadline=10000,
339339
suppress_health_check=[
340340
HealthCheck.too_slow,
341341
HealthCheck.function_scoped_fixture,

tests/test_analytics_configmap_property.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
3030
## Runtime budget
3131
32-
``max_examples=20, deadline=10000`` keeps the test under ~2 min even
32+
``max_examples=20, deadline=20000`` keeps the test under ~2 min even
3333
without caching. With :func:`functools.cache` keyed on ``enabled``
3434
(cardinality 2) the hot loop reuses one cached synth per toggle value
3535
and completes in ~15 s.
@@ -203,7 +203,7 @@ def setup_class(cls) -> None:
203203

204204
@settings(
205205
max_examples=20,
206-
deadline=10000,
206+
deadline=20000,
207207
suppress_health_check=[
208208
HealthCheck.too_slow,
209209
HealthCheck.function_scoped_fixture,

tests/test_analytics_roundtrip_property.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def setup_class(cls) -> None:
9898

9999
@settings(
100100
max_examples=4,
101-
deadline=10000,
101+
deadline=20000,
102102
suppress_health_check=[
103103
HealthCheck.too_slow,
104104
HealthCheck.function_scoped_fixture,

0 commit comments

Comments
 (0)