Parent epic: #11216
Follow-up to: #11226
Main idea
#11226 pooled a single long-lived aiodocker.Docker (wrapping an aiohttp.ClientSession) on DockerAgent, replacing the per-op async with closing_async(Docker()) as docker: pattern. That's a big latency win, but it introduces one production failure mode that the old pattern didn't have:
After systemctl restart docker (or any dockerd bounce), the agent's long-lived aiohttp.ClientSession still holds keepalive sockets in its connector pool. The first call to dockerd after the restart picks a stale socket, fails with aiohttp.ClientConnectionError or ServerDisconnectedError, and surfaces as a spurious failure in purge_images, scan_images, check_image, or any per-op site. Aiohttp reconnects on the next call, so the error is one-shot — but that one shot is enough to fail a user-visible operation.
Per-op Docker() never hit this because each call opened a fresh socket.
Design
Add a thin once-retry wrapper around the per-op aiodocker sites that catches aiohttp.ClientConnectionError / aiohttp.ServerDisconnectedError exactly once, then re-runs the call. On the second failure, propagate normally. No exponential backoff — the retry is specifically for the stale-connection case, which resolves on the next attempt.
Sites to cover (grep self.docker. in src/ai/backend/agent/docker/agent.py):
apply_accelerator_allocation
start_container
get_intrinsic_mounts
destroy_kernel / clean_kernel
extract_image_command
enumerate_containers
scan_images
push_images / pull_images / purge_images
check_image
create_local_network / destroy_local_network
resolve_image_distro
Do NOT wrap:
Alternative ideas
Out of scope
JIRA Issue: BA-5862
Parent epic: #11216
Follow-up to: #11226
Main idea
#11226 pooled a single long-lived
aiodocker.Docker(wrapping anaiohttp.ClientSession) onDockerAgent, replacing the per-opasync with closing_async(Docker()) as docker:pattern. That's a big latency win, but it introduces one production failure mode that the old pattern didn't have:After
systemctl restart docker(or any dockerd bounce), the agent's long-livedaiohttp.ClientSessionstill holds keepalive sockets in its connector pool. The first call to dockerd after the restart picks a stale socket, fails withaiohttp.ClientConnectionErrororServerDisconnectedError, and surfaces as a spurious failure inpurge_images,scan_images,check_image, or any per-op site. Aiohttp reconnects on the next call, so the error is one-shot — but that one shot is enough to fail a user-visible operation.Per-op
Docker()never hit this because each call opened a fresh socket.Design
Add a thin once-retry wrapper around the per-op aiodocker sites that catches
aiohttp.ClientConnectionError/aiohttp.ServerDisconnectedErrorexactly once, then re-runs the call. On the second failure, propagate normally. No exponential backoff — the retry is specifically for the stale-connection case, which resolves on the next attempt.Sites to cover (grep
self.docker.insrc/ai/backend/agent/docker/agent.py):apply_accelerator_allocationstart_containerget_intrinsic_mountsdestroy_kernel/clean_kernelextract_image_commandenumerate_containersscan_imagespush_images/pull_images/purge_imagescheck_imagecreate_local_network/destroy_local_networkresolve_image_distroDo NOT wrap:
monitor_docker_events()— it already has its own reconnect loop overclosing_async(Docker()).DockerStatsStreamer, introduced in refactor(BA-5859): Stream container stats from Docker instead of polling #11224) — those already have bounded-backoff reconnect.Alternative ideas
force_close=Trueon the connector so every request uses a fresh socket. Simpler, but negates refactor(BA-5858): Reuse a long-lived aiodocker client across container operations #11226's latency win. Rejected.Out of scope
Docker()usage — tracked separately within epic Improve agent container resource-mounting performance (aiodocker + runtime layer) #11216.JIRA Issue: BA-5862