Skip to content

[iris] Stop controller-test RemoteLogHandler leak across test session#5045

Merged
rjpower merged 3 commits intomainfrom
rjpower/controller-test-log-handler-leak
Apr 22, 2026
Merged

[iris] Stop controller-test RemoteLogHandler leak across test session#5045
rjpower merged 3 commits intomainfrom
rjpower/controller-test-log-handler-leak

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

Controller.init attaches a RemoteLogHandler to the iris logger and spawns a LogPusher drain thread; tests that constructed a Controller without calling stop() leaked both across the session, so later tests that monkeypatched LogServiceClientSync could receive the stale backlog (seen as flaky CI failures with tens of KB of protobuf repr). Introduces a shared make_controller fixture that tracks and stops every controller at teardown, migrates per-file helpers to use it, and makes Controller.stop() idempotent. Also caps pytest assertion truncation so a future leak cannot flood CI logs again.

Controller.__init__ attaches a RemoteLogHandler to the iris logger and
spawns a LogPusher drain thread; tests that constructed a Controller
without calling stop() leaked both across the session, corrupting
later tests that monkeypatch LogServiceClientSync. Introduce a shared
make_controller fixture that tracks and stops every controller at
teardown, migrate the per-file helpers, and make Controller.stop()
idempotent. Also cap pytest assertion truncation so a future leak can
no longer flood CI logs with protobuf reprs.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 4m 21s —— View job


Review

Overall: This is a solid, well-scoped fix for a real test-isolation bug. The shared make_controller fixture + stop() idempotency is the right design, and capping pytest's diff truncation as a CI guardrail is a nice belt-and-braces touch. A few correctness/cleanliness items worth addressing before merge:


1. Vestigial controller._db.close() in test_checkpoint.py slows the suite

test_checkpoint.py still calls controller._db.close() inside six test bodies (lines 34, 47, 62, 104, 126, 216), even though the fixture's teardown now calls Controller.stop(), which itself closes the DB. The double-close is not a correctness bug, but ControllerDB.close() (lib/iris/src/iris/cluster/controller/db.py:442) tries to drain _read_pool (size 32) with get(timeout=1) and only breaks once it sees Empty. On the second close the pool is already empty, so it blocks for ~1 second per redundant close before bailing out — adding ~6 s of pointless wait per test session.

Either remove the explicit _db.close() calls in test_checkpoint.py, or make ControllerDB.close() idempotent (track a _closed flag and short-circuit, mirroring what you just did for Controller.stop). The first is the simpler change.

Fix this →

2. Bump pytest>=8.4 for the new ini options

truncation_limit_lines / truncation_limit_chars were introduced in pytest 8.4. The constraint in lib/iris/pyproject.toml is still pytest>=8.3.2 (line 66). Today's lockfile resolves to 8.4.2 so this works, but a fresh resolve onto 8.3.x would silently warn and ignore both keys — defeating the CI-protection rationale in the docstring above the keys. Bump the spec to pytest>=8.4.

3. Initialize self._stopped in __init__ rather than via getattr

In controller.py:1290:

if getattr(self, "_stopped", False):
    return
self._stopped = True

The getattr form gracefully handles a partial-init case (a stop() triggered before the attribute was set), but for normal flow it trips type-checkers and obscures the attribute's existence. Setting self._stopped = False early in __init__ (before any side-effecting setup) is more explicit and equally robust against partial init. Minor stylistic call — fine to keep as-is if preferred for the partial-init safety property.

4. Fixture teardown stops on first failure

In conftest.py:245-246:

yield _factory
for controller in created:
    controller.stop()

If one controller's stop() raises, subsequent controllers in created are not stopped — exactly the leak this PR is trying to prevent. Tests today only construct one controller each, so this is defensive, but worth wrapping:

yield _factory
errors = []
for controller in created:
    try:
        controller.stop()
    except Exception as exc:
        errors.append(exc)
if errors:
    raise errors[0]

5. make_controller docstring doesn't show the db= form

The docstring example only shows make_controller(remote_state_dir=...), but test_heartbeat.py uses make_controller(remote_state_dir=..., local_state_dir=tmp_path, db=db). A one-line note in the usage block ("pass db= to inject a pre-built ControllerDB") would help future readers.

6. Import ordering nit in conftest.py

The new line from iris.cluster.controller.controller import Controller, ControllerConfig is inserted at line 50, between iris.cluster.controller.schema and iris.cluster.controller.provider, which is out of alphabetical order. The pre-existing import block isn't strictly sorted either, so isort/ruff likely won't flag it — purely cosmetic.


Things that look good

  • Controller.stop() idempotency is the right fix for the test_dry_run case where test_dry_run_controller_starts_and_stops calls stop() itself, then teardown calls it again.
  • Removing the local tempfile/shutil.rmtree boilerplate from test_scheduling_fairness.py and the per-file factories in test_reservation.py / test_checkpoint.py is a nice deduplication.
  • Switching the tracked_log_service_client assertion in test_remote_log_handler.py to length-only checks is exactly right for "guard against the log spam, but still produce a useful message."

TL;DR

The mechanism is correct. Address (1) — the redundant _db.close() calls — and (2) — the pytest version bump — before merging; (3)–(6) are optional polish.

  • Read PR diff and changed files
  • Examine make_controller fixture and Controller.stop() idempotency
  • Verify migrated test files still cover their original behavior
  • Check pytest assertion truncation cap
  • Post review
    Branch

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

@claude please address PR comments

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 9m 7s —— View job


Addressed all PR review comments

Pushed as 8beb55b41 on rjpower/controller-test-log-handler-leak.

  • (1) Removed six redundant controller._db.close() calls in test_checkpoint.py (lines 34, 47, 62, 104, 126, 216). The make_controller fixture already stops the controller at teardown, which closes the DB; each double-close was blocking ~1s on the read-pool drain.
  • (2) Bumped pytest>=8.4 in lib/iris/pyproject.toml (updated uv.lock with a minimal specifier-only edit — no unrelated dependency re-resolution noise) so truncation_limit_lines / truncation_limit_chars aren't silently dropped on fresh resolves.
  • (3) Noted — self._stopped = False in Controller.__init__ was already in place from commit 20d9d20 before the review landed.
  • (4) Hardened make_controller teardown in conftest.py — it now collects errors from every controller.stop() call and raises the first one after stopping all controllers, so one failure can't leak the rest.
  • (5) Expanded the make_controller docstring to document the db= and provider= injection forms with a concrete usage snippet.
  • (6) Alphabetized the iris.cluster.controller.* imports in conftest.py.

Testing

  • ./infra/pre-commit.py --all-files --fix — OK (all 18 hooks clean).
  • uv run pytest tests/cluster/controller/test_checkpoint.py tests/cluster/controller/test_dry_run.py tests/cluster/controller/test_heartbeat.py tests/cluster/controller/test_reservation.py tests/cluster/controller/test_scheduling_fairness.py tests/test_remote_log_handler.py --timeout=30 -p no:xdist -o addopts= from lib/iris109 passed in 22.43s (covers every file touched in this PR plus the remote-log-handler suite that motivated the fix).
    Branch

…en teardown

- Remove six vestigial controller._db.close() calls in test_checkpoint.py now
  that the make_controller fixture stops the controller (which closes the DB)
  at teardown. Each redundant close blocked for ~1s draining the read pool.
- Bump pytest>=8.4 in lib/iris dev deps so truncation_limit_lines/chars aren't
  silently dropped on fresh resolves that pick 8.3.x.
- Make the make_controller teardown robust to per-controller stop() failures:
  collect errors, stop every controller, then re-raise.
- Expand the make_controller docstring to cover db= / provider= injection.
- Alphabetize the iris.cluster.controller.* import block in conftest.py.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower rjpower requested a review from yonromai April 22, 2026 22:37
@rjpower rjpower enabled auto-merge (squash) April 22, 2026 22:37
@rjpower rjpower merged commit 2dfccb1 into main Apr 22, 2026
41 of 42 checks passed
@rjpower rjpower deleted the rjpower/controller-test-log-handler-leak branch April 22, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants