refactor(BA-5858): Reuse a long-lived aiodocker client across container operations#11226
refactor(BA-5858): Reuse a long-lived aiodocker client across container operations#11226rapsealk wants to merge 5 commits into
Conversation
…rations Own a single `aiodocker.Docker` instance on `DockerAgent` and pass it to `DockerKernelCreationContext`. Per-op sites — `apply_accelerator_allocation`, `start_container`, `get_intrinsic_mounts`, `destroy_kernel`, `clean_kernel`, `extract_image_command`, `enumerate_containers`, `scan_images`, `push_image`, `pull_image`, `purge_images`, `check_image`, `create_local_network`, `destroy_local_network`, and `resolve_image_distro` — now use the shared client instead of opening/closing a fresh `Docker()` per call. A typical kernel start now uses one connection instead of ~5-10 TCP/UDS setups. The long-running `monitor_docker_events()` keeps its own dedicated subscription client since it re-establishes on connection drops. Closes #11218 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR reduces Docker daemon connection churn by creating and reusing a single long-lived aiodocker.Docker client in DockerAgent, and threading it through container/mount/accelerator/image operations that previously opened a new client per call.
Changes:
- Add a shared
self.docker: DockeronDockerAgentand pass it down intoDockerKernelCreationContext. - Replace multiple
async with closing_async(Docker())call sites withself.dockerusage across container start, accelerator arg generation, image operations, and cleanup paths. - Add a changelog entry describing the connection reuse.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/ai/backend/agent/docker/agent.py |
Introduces/propagates a shared Docker client and refactors numerous call sites to reuse it. |
changes/11218.enhance.md |
Documents the behavior change (reuse long-lived aiodocker client). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.docker = Docker() | ||
| docker = self.docker | ||
| docker_host = "" | ||
| match docker.connector: | ||
| case aiohttp.TCPConnector(): | ||
| if docker.docker_host is None: | ||
| raise InvalidArgumentError("docker_host is not set for TCP connector") | ||
| docker_host = docker.docker_host | ||
| case aiohttp.NamedPipeConnector() | aiohttp.UnixConnector() as connector: | ||
| docker_host = connector.path | ||
| case _: | ||
| docker_host = "(unknown)" | ||
| log.info("accessing the local Docker daemon via {}", docker_host) | ||
| docker_version = await docker.version() | ||
| log.info( |
There was a problem hiding this comment.
self.docker = Docker() is now created without an async with/closing_async guard. If any of the subsequent awaits in __ainit__ (e.g., docker.version() / docker.system.info() / recovery adapt) raises, the aiohttp session inside the Docker client will not be closed, leading to unclosed-session warnings and a small resource leak on startup failure. Consider wrapping the initialization block in try/except (or AsyncExitStack) and await self.docker.close() on failure before re-raising.
| image_info = await self.docker.images.inspect(image_ref.canonical) | ||
| if auto_pull == AutoPullBehavior.DIGEST: | ||
| if image_info["Id"] != image_id: | ||
| return True |
There was a problem hiding this comment.
check_image() now uses the long-lived self.docker client. The existing component tests (e.g., tests/component/agent/docker/test_agent.py) patch ai.backend.agent.docker.agent.Docker expecting check_image() to instantiate a new client per call; those patches will no longer affect this method and will cause the tests to hit the real Docker client. Update the tests to patch agent.docker / agent.docker.images.inspect (or patch Docker before creating the agent) so the behavior stays isolated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-pull tests patched the module-level `Docker` constructor, but `DockerAgent.check_image()` now reads from `self.docker`, which is already constructed during the `agent` fixture setup. Patch `agent.docker.images.inspect` directly so the shared-client path is what gets exercised and inspected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ocker Addresses review feedback on PR #11228: - Add a pickle round-trip test asserting __getstate__ excludes _docker, locking the RPC-marshalling invariant for future contributors. - Document on DockerKernel that _docker is local-only and must be re-attached by the agent via attach_docker(). - Warn (but still overwrite) when attach_docker is called with a connector different from the currently-attached one; same-connector reattach remains silent. - Align the self.docker type annotation on DockerAgent with the one introduced in #11226 (no-op if already present). Refs #11227 Refs #11228 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot failure Addresses review feedback on PR #11226: - Wrap shutdown() so self.docker.close() runs in a finally block, preventing an aiohttp.ClientSession leak if super().shutdown() raises. - Release the shared Docker client if __ainit__ raises after the session is constructed. - Add regression tests asserting the session is closed both on successful shutdown and when inner shutdown raises. Refs #11218 Refs #11226 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review summary + tradeoff notes. Resolutions pushed to this branch. Verdict: ship with changes — addressed.
Tradeoffs vs. the original What the original gave up that this PR lost: automatic cleanup (no explicit lifecycle to get wrong — the shutdown-leak bug only exists because we now own it), fault isolation between ops, transparent daemon-restart recovery (shared What the new pattern wins: ~5–10× fewer connections per kernel start (TCP/Unix-socket handshake + HTTP parser + connector init amortized over agent lifetime), keep-alive socket reuse, lower resource footprint. Net: favorable for a long-lived agent. |
- Narrow `except BaseException` to `except Exception` in __ainit__ cleanup: don't intercept KeyboardInterrupt/SystemExit propagation. - Drop the impossible `if self.docker is not None` guard in shutdown() (self.docker is non-Optional and set before any await in __ainit__). - Remove the ornamental docker parameter on _purge_image; its sole caller already passes self.docker. Refs #11218 Refs #11226 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #11218 (BA-5858)
Refs #11216
Summary
aiodocker.Dockeron the agent and thread it into all per-op sites (start_container,apply_accelerator_allocation,generate_accelerator_configs, recovery/inspect, image / container / network ops).async with closing_async(Docker())removed).shutdown()wraps its body intry/finallyfor unconditionaldocker.close()— a raisingsuper().shutdown()no longer leaks the aiohttp session.__ainit__boot-failure path releases the client if startup raises (narrowed toexcept Exception:to letKeyboardInterrupt/SystemExitpropagate).monitor_docker_events()intentionally keeps its ownclosing_async(Docker())— needs per-op isolation for reconnect semantics.Why
Part of epic #11216. A shared client amortizes TCP/Unix-socket + HTTP parser + connector init across the agent's lifetime and lets aiohttp reuse keepalive sockets across ops.
Known trade-offs (tracked as follow-ups)
aiohttp.ClientConnectionErroron the first post-restart call. Follow-up: Retry once on stale-connection errors from shared aiodocker client after dockerd restart #11233 / PR refactor(BA-5862): Retry once on stale aiodocker connections after dockerd restart #11235.src/ai/backend/accelerator/*) still instantiate their ownDocker()insideinit(). Deferred under epic Improve agent container resource-mounting performance (aiodocker + runtime layer) #11216.Test plan
pants test tests/component/agent/docker::passes (shutdown-lifecycle tests + existing image-pull / purge coverage)agent.shutdown(); confirm noUnclosed client sessionwarningsuper().shutdown(); confirm the client is still closed