Skip to content

refactor(BA-5858): Reuse a long-lived aiodocker client across container operations#11226

Open
rapsealk wants to merge 5 commits into
mainfrom
perf/11218-docker-client-pool
Open

refactor(BA-5858): Reuse a long-lived aiodocker client across container operations#11226
rapsealk wants to merge 5 commits into
mainfrom
perf/11218-docker-client-pool

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 22, 2026

Closes #11218 (BA-5858)
Refs #11216

Summary

  • Own a single aiodocker.Docker on the agent and thread it into all per-op sites (start_container, apply_accelerator_allocation, generate_accelerator_configs, recovery/inspect, image / container / network ops).
  • Kernel start now opens ~1 connection instead of ~5-10 (per-op async with closing_async(Docker()) removed).
  • shutdown() wraps its body in try/finally for unconditional docker.close() — a raising super().shutdown() no longer leaks the aiohttp session.
  • __ainit__ boot-failure path releases the client if startup raises (narrowed to except Exception: to let KeyboardInterrupt / SystemExit propagate).
  • monitor_docker_events() intentionally keeps its own closing_async(Docker()) — needs per-op isolation for reconnect semantics.

Why

Part of epic #11216. A shared client amortizes TCP/Unix-socket + HTTP parser + connector init across the agent's lifetime and lets aiohttp reuse keepalive sockets across ops.

Known trade-offs (tracked as follow-ups)

Test plan

  • pants test tests/component/agent/docker:: passes (shutdown-lifecycle tests + existing image-pull / purge coverage)
  • Start a kernel, confirm it launches end-to-end
  • Run agent.shutdown(); confirm no Unclosed client session warning
  • Simulate a raising super().shutdown(); confirm the client is still closed

…rations

Own a single `aiodocker.Docker` instance on `DockerAgent` and pass it to
`DockerKernelCreationContext`. Per-op sites — `apply_accelerator_allocation`,
`start_container`, `get_intrinsic_mounts`, `destroy_kernel`, `clean_kernel`,
`extract_image_command`, `enumerate_containers`, `scan_images`, `push_image`,
`pull_image`, `purge_images`, `check_image`, `create_local_network`,
`destroy_local_network`, and `resolve_image_distro` — now use the shared
client instead of opening/closing a fresh `Docker()` per call. A typical
kernel start now uses one connection instead of ~5-10 TCP/UDS setups.

The long-running `monitor_docker_events()` keeps its own dedicated
subscription client since it re-establishes on connection drops.

Closes #11218

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk added this to the 26.5 milestone Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 04:34
@github-actions github-actions Bot added size:XL 500~ LoC comp:agent Related to Agent component labels Apr 22, 2026
@rapsealk rapsealk changed the title enhance(agent): reuse a long-lived aiodocker client across container operations refactor(agent): Reuse a long-lived aiodocker client across container operations Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Docker daemon connection churn by creating and reusing a single long-lived aiodocker.Docker client in DockerAgent, and threading it through container/mount/accelerator/image operations that previously opened a new client per call.

Changes:

  • Add a shared self.docker: Docker on DockerAgent and pass it down into DockerKernelCreationContext.
  • Replace multiple async with closing_async(Docker()) call sites with self.docker usage across container start, accelerator arg generation, image operations, and cleanup paths.
  • Add a changelog entry describing the connection reuse.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ai/backend/agent/docker/agent.py Introduces/propagates a shared Docker client and refactors numerous call sites to reuse it.
changes/11218.enhance.md Documents the behavior change (reuse long-lived aiodocker client).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/agent/docker/agent.py Outdated
Comment on lines +1483 to +1497
self.docker = Docker()
docker = self.docker
docker_host = ""
match docker.connector:
case aiohttp.TCPConnector():
if docker.docker_host is None:
raise InvalidArgumentError("docker_host is not set for TCP connector")
docker_host = docker.docker_host
case aiohttp.NamedPipeConnector() | aiohttp.UnixConnector() as connector:
docker_host = connector.path
case _:
docker_host = "(unknown)"
log.info("accessing the local Docker daemon via {}", docker_host)
docker_version = await docker.version()
log.info(
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.docker = Docker() is now created without an async with/closing_async guard. If any of the subsequent awaits in __ainit__ (e.g., docker.version() / docker.system.info() / recovery adapt) raises, the aiohttp session inside the Docker client will not be closed, leading to unclosed-session warnings and a small resource leak on startup failure. Consider wrapping the initialization block in try/except (or AsyncExitStack) and await self.docker.close() on failure before re-raising.

Copilot uses AI. Check for mistakes.
Comment on lines +1954 to +1957
image_info = await self.docker.images.inspect(image_ref.canonical)
if auto_pull == AutoPullBehavior.DIGEST:
if image_info["Id"] != image_id:
return True
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_image() now uses the long-lived self.docker client. The existing component tests (e.g., tests/component/agent/docker/test_agent.py) patch ai.backend.agent.docker.agent.Docker expecting check_image() to instantiate a new client per call; those patches will no longer affect this method and will cause the tests to hit the real Docker client. Update the tests to patch agent.docker / agent.docker.images.inspect (or patch Docker before creating the agent) so the behavior stays isolated.

Copilot uses AI. Check for mistakes.
rapsealk and others added 2 commits April 22, 2026 14:18
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-pull tests patched the module-level `Docker` constructor, but
`DockerAgent.check_image()` now reads from `self.docker`, which is
already constructed during the `agent` fixture setup. Patch
`agent.docker.images.inspect` directly so the shared-client path is what
gets exercised and inspected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rapsealk added a commit that referenced this pull request Apr 22, 2026
…ocker

Addresses review feedback on PR #11228:
- Add a pickle round-trip test asserting __getstate__ excludes _docker,
  locking the RPC-marshalling invariant for future contributors.
- Document on DockerKernel that _docker is local-only and must be
  re-attached by the agent via attach_docker().
- Warn (but still overwrite) when attach_docker is called with a
  connector different from the currently-attached one; same-connector
  reattach remains silent.
- Align the self.docker type annotation on DockerAgent with the one
  introduced in #11226 (no-op if already present).

Refs #11227
Refs #11228

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot failure

Addresses review feedback on PR #11226:
- Wrap shutdown() so self.docker.close() runs in a finally block,
  preventing an aiohttp.ClientSession leak if super().shutdown() raises.
- Release the shared Docker client if __ainit__ raises after the
  session is constructed.
- Add regression tests asserting the session is closed both on
  successful shutdown and when inner shutdown raises.

Refs #11218
Refs #11226

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

Review summary + tradeoff notes. Resolutions pushed to this branch.

Verdict: ship with changes — addressed.

Finding Resolution
shutdown() closed self.docker outside the try/finally wrapping super().shutdown() — a raising inner shutdown leaked the aiohttp.ClientSession Wrapped the full shutdown() body in try: ... finally: await self.docker.close() (commit 801ed1e90)
__ainit__ raising after self.docker = Docker() also leaked Narrow try/except BaseException: await self.docker.close(); raise around post-construction startup (includes KeyboardInterrupt/SystemExit)
test_auto_pull_* broke because class-level patch("...Docker") no longer intercepts the shared instance Realigned tests to patch agent.docker.images.inspect directly (commit 27f9c3616)
No regression coverage for the shutdown lifecycle Added tests asserting session closed after normal shutdown() AND after a raising super().shutdown()

Tradeoffs vs. the original async with closing_async(Docker()) pattern:

What the original gave up that this PR lost: automatic cleanup (no explicit lifecycle to get wrong — the shutdown-leak bug only exists because we now own it), fault isolation between ops, transparent daemon-restart recovery (shared ClientSession survives but a stale keep-alive socket can surface a one-shot ClientConnectionError), trivial class-level mocking (hence the test realignment), no pickling hazard.

What the new pattern wins: ~5–10× fewer connections per kernel start (TCP/Unix-socket handshake + HTTP parser + connector init amortized over agent lifetime), keep-alive socket reuse, lower resource footprint.

Net: favorable for a long-lived agent. monitor_docker_events() correctly keeps its own closing_async(Docker()) since stream reconnect semantics need per-op isolation. Accelerator plugins (src/ai/backend/accelerator/*/plugin.py) still instantiate Docker() inside init() — deferred to a follow-up within epic #11216.

@rapsealk rapsealk requested review from a team and achimnol April 22, 2026 06:17
- Narrow `except BaseException` to `except Exception` in __ainit__
  cleanup: don't intercept KeyboardInterrupt/SystemExit propagation.
- Drop the impossible `if self.docker is not None` guard in shutdown()
  (self.docker is non-Optional and set before any await in __ainit__).
- Remove the ornamental docker parameter on _purge_image; its sole
  caller already passes self.docker.

Refs #11218
Refs #11226

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk changed the title refactor(agent): Reuse a long-lived aiodocker client across container operations refactor(BA-5858): Reuse a long-lived aiodocker client across container operations Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pool the Docker daemon connection in the agent

3 participants