Skip to content

fix(agent): Scope aiodocker client in DockerKernel.commit() via closing_async#11228

Closed
rapsealk wants to merge 1 commit into
mainfrom
perf/11227-dockerkernel-shared-client
Closed

fix(agent): Scope aiodocker client in DockerKernel.commit() via closing_async#11228
rapsealk wants to merge 1 commit into
mainfrom
perf/11227-dockerkernel-shared-client

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 22, 2026

Summary

  • Wrap the bare Docker() in DockerKernel.commit() with closing_async(...) so the aiohttp.ClientSession is always closed.
  • Matches the pattern already used by sibling methods (get_logs, download_file, download_single).

Why not the broader shared-client refactor?

The original scope of #11227 threaded a shared Docker client into DockerKernel and its recovery path. After review, the ongoing invariant-maintenance cost (pickle exclusion, attach_docker() ordering on recovery, constructor plumbing) outweighed the latency gains on these rarely-called methods. The only real bug — a leaked session in commit() — is fixed here in isolation. See the prior discussion on this PR for the full tradeoff analysis.

Test plan

  • pants check src/ai/backend/agent/docker:: passes
  • Commit a kernel and verify no aiohttp-session leak warnings on agent shutdown.

Closes #11227
Refs #11216

@rapsealk rapsealk added this to the 26.5 milestone Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 05:07
@github-actions github-actions Bot added size:L 100~500 LoC comp:agent Related to Agent component labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Docker agent/kernel integration so DockerKernel operations reuse the agent-owned long-lived aiodocker.Docker client (instead of creating a new client per call), while ensuring recovery/unpickling paths re-attach the shared client and RPC/pickle payloads don’t carry a live Docker session.

Changes:

  • Inject a shared Docker client into DockerKernel and route get_logs, commit, download_file, and download_single through it.
  • Update kernel recovery/loader paths to pass/re-attach the shared Docker client; exclude _docker from pickled kernel state and provide attach_docker().
  • Move shared client initialization earlier in DockerAgent.__ainit__ and adjust relevant constructors/tests accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/agent/test_container_id_sync.py Updates test fixture to pass the newly required docker argument to DockerKernel.
src/ai/backend/agent/stage/kernel_lifecycle/docker/kernel_object.py Extends kernel object stage spec to carry the shared Docker client into DockerKernel construction.
src/ai/backend/agent/kernel_registry/types.py Changes recovery-to-kernel conversion to require an injected Docker client.
src/ai/backend/agent/kernel_registry/loader/container.py Passes the agent’s shared Docker client into recovered DockerKernel instances during load.
src/ai/backend/agent/docker/kernel.py Stores injected Docker client on DockerKernel, avoids pickling it, and uses it for logs/commit/download operations.
src/ai/backend/agent/docker/agent.py Initializes shared Docker client earlier, passes it into kernel creation contexts, and re-attaches it to recovered kernels.
changes/11227.enhance.md Adds changelog entry describing the refactor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/agent/docker/kernel.py Outdated
Comment on lines +261 to +268
# The shared aiodocker session's timeout is mutated for the duration of
# this commit call and restored afterwards, so concurrent users of the
# shared client are not affected.
commit_timeout = aiohttp.ClientTimeout(
total=self.agent_config["api"]["commit-timeout"]
)
previous_timeout = docker.session._timeout
docker.session._timeout = commit_timeout
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit() mutates the shared aiodocker session's private _timeout field. Even if you restore it in finally, this change is still visible to other concurrent operations using the same shared client while the commit is in-flight, and nested/overlapping commits can temporarily apply the wrong timeout. Prefer a per-call timeout strategy (e.g., wrap the specific awaitable(s) in asyncio.wait_for, or use a separate short-lived Docker client just for commit-timeout tweaking) and update the comment that claims concurrent users are not affected.

Copilot uses AI. Check for mistakes.
Comment thread src/ai/backend/agent/docker/agent.py Outdated
Comment on lines +1488 to +1493
# Long-lived shared aiodocker client; must be available before kernel recovery
# loads any DockerKernel instances via `super().__ainit__()`.
self.docker = Docker()
docker = self.docker
docker_host = ""
match docker.connector:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.docker = Docker() is created early in __ainit__(), but if any later step in __ainit__ raises (e.g., docker.version()/system.info()), the client session will be leaked because shutdown may never run. Consider wrapping the init sequence in a try/except that closes self.docker on failure (and/or initialize self.docker to None in __init__ so later cleanup paths can safely check it).

Copilot uses AI. Check for mistakes.
rapsealk added a commit that referenced this pull request Apr 22, 2026
…ocker

Addresses review feedback on PR #11228:
- Add a pickle round-trip test asserting __getstate__ excludes _docker,
  locking the RPC-marshalling invariant for future contributors.
- Document on DockerKernel that _docker is local-only and must be
  re-attached by the agent via attach_docker().
- Warn (but still overwrite) when attach_docker is called with a
  connector different from the currently-attached one; same-connector
  reattach remains silent.
- Align the self.docker type annotation on DockerAgent with the one
  introduced in #11226 (no-op if already present).

Refs #11227
Refs #11228

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

Review summary + tradeoff notes — worth considering whether to close this PR in favor of a narrower fix.

Resolutions pushed to this branch (commit 6a312195e):

  • Pickle round-trip test locking __getstate__ exclusion of _docker (invariant most at risk from future contributors).
  • Class-level docstring documenting the _docker-is-local-only / must-re-attach invariant.
  • attach_docker() logs log.warning on client-swap (mismatched connector); same-connector reattach silent.
  • DockerAgent.docker: Docker class annotation aligned with refactor(BA-5858): Reuse a long-lived aiodocker client across container operations #11226.

Why this PR is worth a second look on scope:

The original async with closing_async(Docker()) as docker: in DockerKernel.get_logs/download_file/download_single gave up more than it appears at first:

Property preserved by original Cost paid by this PR
No pickling hazard DockerKernel is RPC-marshalled → __getstate__/__setstate__ exclusion is now load-bearing; one wrong line in a future PR silently breaks RPC
No recovery-ordering window Unpickled kernels have no _docker until _load_kernel_registry_from_recovery runs attach_docker() on each — any method fired in that window raises AttributeError
Simple construction Kw-only docker= threaded through DockerKernel.__init__, DockerKernelCreationContext, KernelObjectSpec, and the recovery loader (plus a cast("DockerAgent", ...))
Trivial test seam Tests mocking Docker now need to reach through kernel._docker
Fault isolation A bad session state now fails every subsequent op on that kernel

Gains: get_logs, download_file, download_single are low-frequency, not per-kernel-start hot path. The only unambiguous win is the commit() leak fix — the old commit() used a bare Docker() with no closing_async, a real bug on every commit.


Suggested alternative: close this PR and open a minimal one that:

  1. Wraps the existing commit() bare Docker() in closing_async(...) — fixes the real leak in ~3 lines.
  2. Leaves get_logs, download_file, download_single on their current closing_async(Docker()) pattern.

That fixes the actual bug, keeps the fault-isolation / no-pickling-hazard / simple-testing properties of the original design for rarely-called methods, and avoids paying ongoing invariant-maintenance cost for marginal latency wins.

Arguments for keeping this PR:

My lean: narrow fix is cleaner, but the call belongs to @lablup maintainers. Happy to spin up the commit()-only alternative if preferred.

…ng_async

DockerKernel.commit() previously instantiated aiodocker.Docker() bare
and relied on ad-hoc session cleanup, which leaked the underlying
aiohttp.ClientSession on every commit. Wrap construction in
closing_async(...) so the session is always closed, matching the
pattern already used in get_logs/download_file/download_single.

Closes #11227
Refs #11216

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk force-pushed the perf/11227-dockerkernel-shared-client branch from 6a31219 to 3be2af5 Compare April 22, 2026 06:09
@github-actions github-actions Bot added size:XS ~10 LoC and removed size:L 100~500 LoC labels Apr 22, 2026
@rapsealk rapsealk changed the title refactor(agent): Reuse the shared aiodocker client in DockerKernel methods fix(agent): Scope aiodocker client in DockerKernel.commit() via closing_async Apr 22, 2026
@rapsealk rapsealk requested review from a team and achimnol April 22, 2026 06:17
@rapsealk
Copy link
Copy Markdown
Member Author

Closing as not planned.

Rereading the original DockerKernel.commit() on main: it was already closing the session correctly via try/finally: await docker.close() — not leaking. The narrow async with closing_async(...) rewrite in the current HEAD of this PR is cosmetic (style/idiom consistency with sibling methods), not a fix. The fix(agent): ... title over-promises.

The broader shared-client extension into DockerKernel (the original scope of this PR — pickling exclusion, attach_docker(), recovery ordering, constructor plumbing through ~4 sites) was judged unfavorable on cost/benefit for get_logs / download_file / download_single / commit, which are not per-kernel-start hot path. See the prior discussion in this thread.

If the try/finallyasync with closing_async(...) idiom cleanup is worth doing, it can be swept in organically during a future touch to docker/kernel.py, not as its own PR.

Closing #11227 (not planned) alongside this PR.

@rapsealk rapsealk closed this Apr 22, 2026
@rapsealk rapsealk deleted the perf/11227-dockerkernel-shared-client branch April 22, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XS ~10 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reuse the shared aiodocker client in DockerKernel methods

2 participants