Skip to content

refactor(iris): replace LocalController with LocalCluster, decouple from Platform#3632

Merged
rjpower merged 2 commits intomainfrom
multi/df925d72
Mar 13, 2026
Merged

refactor(iris): replace LocalController with LocalCluster, decouple from Platform#3632
rjpower merged 2 commits intomainfrom
multi/df925d72

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 13, 2026

Summary

  • Break circular dependency between controller/local.pyplatform/local.py by moving LocalController to a new top-level local_cluster.py module (renamed LocalCluster)
  • LocalPlatform no longer owns controller lifecycle — callers use LocalCluster directly for local mode
  • Fix stop()close() in several callers to prevent _db_dir temp directory leaks

Changes

File Change
cluster/local_cluster.py NewLocalCluster + create_local_autoscaler + make_local_cluster_config
cluster/controller/local.py Deleted — replaced by local_cluster.py
cluster/platform/local.py Remove start/stop/restart_controller, wait_for_controller, auto_login_token; inline stop_all
cluster/manager.py connect_cluster() detects local mode and uses LocalCluster directly
cli/cluster.py cluster start uses LocalCluster for local mode
cli/main.py CLI setup uses LocalCluster; fix stop()close()
client/client.py Import from new location; fix stop()close()
Tests (4 files) Update imports LocalControllerLocalCluster; fix stop()close()

Test plan

  • All 1276 unit tests pass (uv run pytest lib/iris/tests/ -m "not e2e")
  • Pre-commit hooks pass (./infra/pre-commit.py --all-files --fix)
  • E2E tests (require cluster)

Closes #3585

🤖 Generated with Claude Code

…rom Platform

Break the circular dependency between controller/local.py and platform/local.py
by moving LocalController to a new top-level local_cluster.py module (renamed
to LocalCluster). LocalPlatform no longer owns controller lifecycle — callers
use LocalCluster directly for local mode.

- New: lib/iris/src/iris/cluster/local_cluster.py (LocalCluster + create_local_autoscaler)
- Delete: lib/iris/src/iris/cluster/controller/local.py
- Simplify: LocalPlatform — remove start/stop/restart_controller, wait_for_controller, auto_login_token
- Update: connect_cluster(), CLI cluster start, CLI main, IrisClient, all tests
- Fix: stop() → close() in several callers to prevent _db_dir temp dir leaks

Closes #3585

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 13, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 13, 2026

Claude finished @rjpower's task in 1m 30s —— View job


PR Review: refactor(iris): replace LocalController with LocalCluster, decouple from Platform

  • Read changed files and understand architecture
  • Analyze correctness and potential issues
  • Post review feedback

Overall: Clean refactor that breaks a real circular dependency and improves resource cleanup. The rename from LocalControllerLocalCluster is well-motivated since the class manages both Controller and Autoscaler+Workers. The stop() vs close() distinction (stop = pause, close = full teardown including DB dir) is clear and well-documented.

Issues Found

1. _db_dir leak in test_demo_notebook_submit.py (bug)

lib/iris/tests/examples/test_demo_notebook_submit.py:40 — The demo_client fixture calls controller.stop() instead of controller.close(). This leaks the _db_dir TemporaryDirectory. The PR description explicitly lists fixing stop()close() as a goal, but this call site was missed.

2. _db_dir leak in test_local_cluster.py (bug)

lib/iris/tests/cli/test_local_cluster.py:145 — Same issue: the finally block calls controller.stop() instead of controller.close(). Since this test doesn't exercise restart semantics, close() is the right call here.

3. Stale doc reference to LocalController

lib/iris/docs/testing-design.md still references LocalController in two places (lines 10 and 128). Should be updated to LocalCluster for consistency.

Suggestions (non-blocking)

  • LocalCluster could implement __enter__/__exit__ (calling start()/close()) to make context-manager usage less error-prone — though this can come later.

What looks good

  • The module docstring at the top of local_cluster.py clearly explains why it lives outside controller/ and platform/
  • close() wrapping stop() + _db_dir.cleanup() is a clean two-level teardown
  • manager.py correctly uses close() in the finally of connect_cluster()
  • cli/main.py correctly registers cluster.close via ctx.call_on_close
  • client.py correctly calls close() in shutdown()
  • Test infrastructure (E2ECluster.__exit__, _docker_cluster.py) correctly uses close() for LocalCluster and stop() for raw Controller

@rjpower rjpower requested a review from yonromai March 13, 2026 22:40
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 13, 2026

@claude fix the review issues

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 13, 2026

Claude finished @rjpower's task in 2m 2s —— View job


Fixed review issues

  • Fix stop()close() in test_demo_notebook_submit.py — prevents _db_dir temp directory leak
  • Fix stop()close() in test_local_cluster.py — same fix
  • Update LocalControllerLocalCluster in testing-design.md (2 references)
  • Pre-commit passes (./infra/pre-commit.py --all-files --fix)
  • Tests pass (4/4 in test_demo_notebook_submit.py and test_local_cluster.py)
  • Committed and pushed

View job

…ller refs

Fix _db_dir temp directory leaks in test_demo_notebook_submit.py and
test_local_cluster.py by using close() instead of stop(). Update stale
LocalController references in testing-design.md to LocalCluster.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower rjpower enabled auto-merge (squash) March 13, 2026 22:53
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This refactor matches #3585's main goal: it cleanly extracts local-mode orchestration into LocalCluster and removes the circular dependency between LocalPlatform and the old LocalController.

Specification check: the module split and call-site updates follow the issue's proposed architecture, and the earlier stop() -> close() misses were fixed. The remaining regression is that the new LocalCluster(config) entry points no longer preserve the old thread-container isolation, so shutting down a local cluster can now stop unrelated managed threads in the same process.

Generated with Codex

if is_local:
from iris.cluster.local_cluster import LocalCluster

cluster = LocalCluster(config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This changes local-mode thread ownership in a way that can stop unrelated managed threads in the same process. Before this PR, connect_cluster() went through LocalPlatform.start_controller(), which passed the platform's private ThreadContainer into LocalController. Now LocalCluster(config) is created with threads=None, so Controller/Autoscaler fall back to the ambient default container via get_thread_container(). On shutdown, Controller.stop() -> Autoscaler.shutdown() calls self._threads.stop(), which now stops unrelated threads sharing that container.

I reproduced this directly with a thread_container_scope("probe") probe: on this PR, an unrelated thread is dead after with connect_cluster(...):; on origin/main it stays alive. iris cluster start --local has the same regression for the same reason.

Recommended fix: keep the new LocalCluster split, but preserve the old isolation by creating an explicit ThreadContainer/platform-owned container at these new call sites and passing it into LocalCluster(threads=...).

Generated with Codex

@rjpower rjpower merged commit 34594b2 into main Mar 13, 2026
21 of 22 checks passed
@rjpower rjpower deleted the multi/df925d72 branch March 13, 2026 22:55
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 13, 2026

@claude can you fix yonromai's issue in a separate PR if you think it's appropraite. turn on automerge, tag yonromai for review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 13, 2026

Claude finished @rjpower's task in 6m 20s —— View jobclaude/pr-3632-20260313-2305 • [Create PR ➔](https://github.com/marin-community/marin/compare/main...claude/pr-3632-20260313-2305?quick_pull=1&title=PR%20%233632%3A%20Changes%20from%20Claude&body=This%20PR%20addresses%20pr%20%233632%0A%0AGenerated%20with%20%5BClaude%20Code%5D(https%3A%2F%2Fclaude.ai%2Fcode)


Fix thread container isolation regression in LocalCluster

  • Read relevant source files to understand the thread container pattern
  • Identify all call sites that create LocalCluster without explicit ThreadContainer
    • manager.py:36connect_cluster()
    • cli/main.py:119require_controller_url()
    • cli/cluster.py:251cluster_start()
    • client/client.py:524IrisClient.local()
  • Implement fix: pass dedicated ThreadContainer at each call site
  • Run pre-commit and tests — all 1274 unit tests pass, pre-commit clean
  • Create PR, enable automerge, request review from yonromai

Created #3637 with automerge enabled and @yonromai requested for review.

Fix: Each LocalCluster call site now creates a dedicated ThreadContainer("local-cluster") so that stopping a local cluster only stops threads it owns, not unrelated threads sharing the ambient default container.

claude bot pushed a commit that referenced this pull request Mar 13, 2026
…ites

LocalCluster created with threads=None falls back to the ambient default
ThreadContainer. On shutdown, stopping the controller/autoscaler would
stop unrelated threads sharing that container. Fix by creating a dedicated
ThreadContainer("local-cluster") at each call site.

Fixes the regression identified in #3632 review.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
ruili33 pushed a commit that referenced this pull request Mar 25, 2026
…rom Platform (#3632)

- Break circular dependency between `controller/local.py` ↔
`platform/local.py` by moving `LocalController` to a new top-level
`local_cluster.py` module (renamed `LocalCluster`)
- `LocalPlatform` no longer owns controller lifecycle — callers use
`LocalCluster` directly for local mode
- Fix `stop()` → `close()` in several callers to prevent `_db_dir` temp
directory leaks

## Changes

| File | Change |
|------|--------|
| `cluster/local_cluster.py` | **New** — `LocalCluster` +
`create_local_autoscaler` + `make_local_cluster_config` |
| `cluster/controller/local.py` | **Deleted** — replaced by
`local_cluster.py` |
| `cluster/platform/local.py` | Remove `start/stop/restart_controller`,
`wait_for_controller`, `auto_login_token`; inline `stop_all` |
| `cluster/manager.py` | `connect_cluster()` detects local mode and uses
`LocalCluster` directly |
| `cli/cluster.py` | `cluster start` uses `LocalCluster` for local mode
|
| `cli/main.py` | CLI setup uses `LocalCluster`; fix `stop()` →
`close()` |
| `client/client.py` | Import from new location; fix `stop()` →
`close()` |
| Tests (4 files) | Update imports `LocalController` → `LocalCluster`;
fix `stop()` → `close()` |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor local cluster setup: replace LocalController with flat orchestrator, decouple Platform from controller lifecycle

2 participants