fix(agent): Scope aiodocker client in DockerKernel.commit() via closing_async#11228
fix(agent): Scope aiodocker client in DockerKernel.commit() via closing_async#11228rapsealk wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the Docker agent/kernel integration so DockerKernel operations reuse the agent-owned long-lived aiodocker.Docker client (instead of creating a new client per call), while ensuring recovery/unpickling paths re-attach the shared client and RPC/pickle payloads don’t carry a live Docker session.
Changes:
- Inject a shared
Dockerclient intoDockerKerneland routeget_logs,commit,download_file, anddownload_singlethrough it. - Update kernel recovery/loader paths to pass/re-attach the shared Docker client; exclude
_dockerfrom pickled kernel state and provideattach_docker(). - Move shared client initialization earlier in
DockerAgent.__ainit__and adjust relevant constructors/tests accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/agent/test_container_id_sync.py | Updates test fixture to pass the newly required docker argument to DockerKernel. |
| src/ai/backend/agent/stage/kernel_lifecycle/docker/kernel_object.py | Extends kernel object stage spec to carry the shared Docker client into DockerKernel construction. |
| src/ai/backend/agent/kernel_registry/types.py | Changes recovery-to-kernel conversion to require an injected Docker client. |
| src/ai/backend/agent/kernel_registry/loader/container.py | Passes the agent’s shared Docker client into recovered DockerKernel instances during load. |
| src/ai/backend/agent/docker/kernel.py | Stores injected Docker client on DockerKernel, avoids pickling it, and uses it for logs/commit/download operations. |
| src/ai/backend/agent/docker/agent.py | Initializes shared Docker client earlier, passes it into kernel creation contexts, and re-attaches it to recovered kernels. |
| changes/11227.enhance.md | Adds changelog entry describing the refactor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # The shared aiodocker session's timeout is mutated for the duration of | ||
| # this commit call and restored afterwards, so concurrent users of the | ||
| # shared client are not affected. | ||
| commit_timeout = aiohttp.ClientTimeout( | ||
| total=self.agent_config["api"]["commit-timeout"] | ||
| ) | ||
| previous_timeout = docker.session._timeout | ||
| docker.session._timeout = commit_timeout |
There was a problem hiding this comment.
commit() mutates the shared aiodocker session's private _timeout field. Even if you restore it in finally, this change is still visible to other concurrent operations using the same shared client while the commit is in-flight, and nested/overlapping commits can temporarily apply the wrong timeout. Prefer a per-call timeout strategy (e.g., wrap the specific awaitable(s) in asyncio.wait_for, or use a separate short-lived Docker client just for commit-timeout tweaking) and update the comment that claims concurrent users are not affected.
| # Long-lived shared aiodocker client; must be available before kernel recovery | ||
| # loads any DockerKernel instances via `super().__ainit__()`. | ||
| self.docker = Docker() | ||
| docker = self.docker | ||
| docker_host = "" | ||
| match docker.connector: |
There was a problem hiding this comment.
self.docker = Docker() is created early in __ainit__(), but if any later step in __ainit__ raises (e.g., docker.version()/system.info()), the client session will be leaked because shutdown may never run. Consider wrapping the init sequence in a try/except that closes self.docker on failure (and/or initialize self.docker to None in __init__ so later cleanup paths can safely check it).
…ocker Addresses review feedback on PR #11228: - Add a pickle round-trip test asserting __getstate__ excludes _docker, locking the RPC-marshalling invariant for future contributors. - Document on DockerKernel that _docker is local-only and must be re-attached by the agent via attach_docker(). - Warn (but still overwrite) when attach_docker is called with a connector different from the currently-attached one; same-connector reattach remains silent. - Align the self.docker type annotation on DockerAgent with the one introduced in #11226 (no-op if already present). Refs #11227 Refs #11228 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review summary + tradeoff notes — worth considering whether to close this PR in favor of a narrower fix. Resolutions pushed to this branch (commit
Why this PR is worth a second look on scope: The original
Gains: Suggested alternative: close this PR and open a minimal one that:
That fixes the actual bug, keeps the fault-isolation / no-pickling-hazard / simple-testing properties of the original design for rarely-called methods, and avoids paying ongoing invariant-maintenance cost for marginal latency wins. Arguments for keeping this PR:
My lean: narrow fix is cleaner, but the call belongs to @lablup maintainers. Happy to spin up the |
…ng_async DockerKernel.commit() previously instantiated aiodocker.Docker() bare and relied on ad-hoc session cleanup, which leaked the underlying aiohttp.ClientSession on every commit. Wrap construction in closing_async(...) so the session is always closed, matching the pattern already used in get_logs/download_file/download_single. Closes #11227 Refs #11216 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6a31219 to
3be2af5
Compare
|
Closing as not planned. Rereading the original The broader shared-client extension into If the Closing #11227 (not planned) alongside this PR. |
Summary
Docker()inDockerKernel.commit()withclosing_async(...)so theaiohttp.ClientSessionis always closed.get_logs,download_file,download_single).Why not the broader shared-client refactor?
The original scope of #11227 threaded a shared
Dockerclient intoDockerKerneland its recovery path. After review, the ongoing invariant-maintenance cost (pickle exclusion,attach_docker()ordering on recovery, constructor plumbing) outweighed the latency gains on these rarely-called methods. The only real bug — a leaked session incommit()— is fixed here in isolation. See the prior discussion on this PR for the full tradeoff analysis.Test plan
pants check src/ai/backend/agent/docker::passesCloses #11227
Refs #11216