Skip to content

Retry once on stale-connection errors from shared aiodocker client after dockerd restart #11233

@rapsealk

Description

@rapsealk

Parent epic: #11216
Follow-up to: #11226

Main idea

#11226 pooled a single long-lived aiodocker.Docker (wrapping an aiohttp.ClientSession) on DockerAgent, replacing the per-op async with closing_async(Docker()) as docker: pattern. That's a big latency win, but it introduces one production failure mode that the old pattern didn't have:

After systemctl restart docker (or any dockerd bounce), the agent's long-lived aiohttp.ClientSession still holds keepalive sockets in its connector pool. The first call to dockerd after the restart picks a stale socket, fails with aiohttp.ClientConnectionError or ServerDisconnectedError, and surfaces as a spurious failure in purge_images, scan_images, check_image, or any per-op site. Aiohttp reconnects on the next call, so the error is one-shot — but that one shot is enough to fail a user-visible operation.

Per-op Docker() never hit this because each call opened a fresh socket.

Design

Add a thin once-retry wrapper around the per-op aiodocker sites that catches aiohttp.ClientConnectionError / aiohttp.ServerDisconnectedError exactly once, then re-runs the call. On the second failure, propagate normally. No exponential backoff — the retry is specifically for the stale-connection case, which resolves on the next attempt.

Sites to cover (grep self.docker. in src/ai/backend/agent/docker/agent.py):

  • apply_accelerator_allocation
  • start_container
  • get_intrinsic_mounts
  • destroy_kernel / clean_kernel
  • extract_image_command
  • enumerate_containers
  • scan_images
  • push_images / pull_images / purge_images
  • check_image
  • create_local_network / destroy_local_network
  • resolve_image_distro

Do NOT wrap:

Alternative ideas

Out of scope

JIRA Issue: BA-5862

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Story.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions