Retry once on stale-connection errors from shared aiodocker client after dockerd restart

Parent epic: #11216
Follow-up to: #11226

## Main idea

#11226 pooled a single long-lived `aiodocker.Docker` (wrapping an `aiohttp.ClientSession`) on `DockerAgent`, replacing the per-op `async with closing_async(Docker()) as docker:` pattern. That's a big latency win, but it introduces one production failure mode that the old pattern didn't have:

After `systemctl restart docker` (or any dockerd bounce), the agent's long-lived `aiohttp.ClientSession` still holds keepalive sockets in its connector pool. The first call to dockerd after the restart picks a stale socket, fails with `aiohttp.ClientConnectionError` or `ServerDisconnectedError`, and surfaces as a spurious failure in `purge_images`, `scan_images`, `check_image`, or any per-op site. Aiohttp reconnects on the _next_ call, so the error is one-shot — but that one shot is enough to fail a user-visible operation.

Per-op `Docker()` never hit this because each call opened a fresh socket.

## Design

Add a thin once-retry wrapper around the per-op aiodocker sites that catches `aiohttp.ClientConnectionError` / `aiohttp.ServerDisconnectedError` exactly once, then re-runs the call. On the second failure, propagate normally. No exponential backoff — the retry is specifically for the stale-connection case, which resolves on the next attempt.

Sites to cover (grep `self.docker.` in `src/ai/backend/agent/docker/agent.py`):

- `apply_accelerator_allocation`
- `start_container`
- `get_intrinsic_mounts`
- `destroy_kernel` / `clean_kernel`
- `extract_image_command`
- `enumerate_containers`
- `scan_images`
- `push_images` / `pull_images` / `purge_images`
- `check_image`
- `create_local_network` / `destroy_local_network`
- `resolve_image_distro`

Do NOT wrap:

- `monitor_docker_events()` — it already has its own reconnect loop over `closing_async(Docker())`.
- Streaming reads (`DockerStatsStreamer`, introduced in #11224) — those already have bounded-backoff reconnect.

## Alternative ideas

- Set a short `force_close=True` on the connector so every request uses a fresh socket. Simpler, but negates #11226's latency win. Rejected.
- Health-probe on interval and recreate the client when dockerd version endpoint fails. More infrastructure, catches the same case. Overkill for now.

## Out of scope

- Accelerator plugins' `Docker()` usage — tracked separately within epic #11216.


JIRA Issue: BA-5862

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry once on stale-connection errors from shared aiodocker client after dockerd restart #11233

Main idea

Design

Alternative ideas

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retry once on stale-connection errors from shared aiodocker client after dockerd restart #11233

Description

Main idea

Design

Alternative ideas

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions