-
Notifications
You must be signed in to change notification settings - Fork 175
refactor(BA-5859): Stream container stats from Docker instead of polling #11224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
7178d9a
7fc6ac9
43e9ff7
339d6a8
dda704d
1dd6f3e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Stream container stats from Docker via a long-lived reader instead of polling `container.stats(stream=False)` per collection cycle. |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -131,6 +131,12 @@ def read_netns_net_dev(ns_path: Path) -> ContainerNetStat: | |||||||||||
| return _parse_proc_net_dev(content) | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def _validate_stats_entry(entry: dict[str, Any]) -> dict[str, Any] | None: | ||||||||||||
| if entry["read"].startswith("0001-01-01") or entry["preread"].startswith("0001-01-01"): | ||||||||||||
| return None | ||||||||||||
| return entry | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| async def fetch_api_stats(container: DockerContainer) -> dict[str, Any] | None: | ||||||||||||
| short_cid = container.id[:7] | ||||||||||||
| try: | ||||||||||||
|
|
@@ -149,7 +155,7 @@ async def fetch_api_stats(container: DockerContainer) -> dict[str, Any] | None: | |||||||||||
| ) | ||||||||||||
| return None | ||||||||||||
| else: | ||||||||||||
| entry = {"read": "0001-01-01"} | ||||||||||||
| entry: dict[str, Any] = {"read": "0001-01-01"} | ||||||||||||
| # aiodocker 0.16 or later returns a list of dict, even when not streaming. | ||||||||||||
| match ret: | ||||||||||||
| case list() if ret: | ||||||||||||
|
|
@@ -164,9 +170,102 @@ async def fetch_api_stats(container: DockerContainer) -> dict[str, Any] | None: | |||||||||||
| ret, | ||||||||||||
| ) | ||||||||||||
| return None | ||||||||||||
| if entry["read"].startswith("0001-01-01") or entry["preread"].startswith("0001-01-01"): | ||||||||||||
| return _validate_stats_entry(entry) | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class DockerStatsStreamer: | ||||||||||||
| """ | ||||||||||||
| Maintains one long-lived `container.stats(stream=True)` reader per container and | ||||||||||||
| exposes the most recent decoded sample from an in-memory cache. | ||||||||||||
|
|
||||||||||||
| Callers read the cached sample via :meth:`get_latest` instead of issuing a new | ||||||||||||
| HTTP round-trip every collection cycle. The first call for a previously-unseen | ||||||||||||
| container lazily spawns the reader; no sample is returned until dockerd has | ||||||||||||
| emitted at least one frame. | ||||||||||||
|
|
||||||||||||
| Readers self-terminate when the upstream iterator ends (container removed) or | ||||||||||||
| raises a connection error. A subsequent :meth:`get_latest` call starts a fresh | ||||||||||||
| reader if the container reappears. | ||||||||||||
|
|
||||||||||||
| TODO(#11219): Wire start/stop into the agent's container-lifecycle hooks | ||||||||||||
| (_handle_start_event / _handle_clean_event) to avoid relying on lazy startup. | ||||||||||||
| """ | ||||||||||||
|
|
||||||||||||
| _docker: Docker | ||||||||||||
| _latest: dict[str, dict[str, Any]] | ||||||||||||
| _tasks: dict[str, asyncio.Task[None]] | ||||||||||||
| _closed: bool | ||||||||||||
|
|
||||||||||||
| def __init__(self, docker: Docker) -> None: | ||||||||||||
| self._docker = docker | ||||||||||||
| self._latest = {} | ||||||||||||
| self._tasks = {} | ||||||||||||
| self._closed = False | ||||||||||||
|
|
||||||||||||
| def get_latest(self, container_id: str) -> dict[str, Any] | None: | ||||||||||||
| """Return the most recent cached sample for `container_id`, starting a | ||||||||||||
| reader if none is running. Returns None until the first frame arrives.""" | ||||||||||||
| if self._closed: | ||||||||||||
| return None | ||||||||||||
| return entry | ||||||||||||
| task = self._tasks.get(container_id) | ||||||||||||
| if task is None or task.done(): | ||||||||||||
| self._tasks[container_id] = asyncio.create_task( | ||||||||||||
| self._read_stream(container_id), | ||||||||||||
| name=f"docker-stats-stream:{container_id[:7]}", | ||||||||||||
| ) | ||||||||||||
| return self._latest.get(container_id) | ||||||||||||
|
|
||||||||||||
| async def stop(self, container_id: str) -> None: | ||||||||||||
| task = self._tasks.pop(container_id, None) | ||||||||||||
| self._latest.pop(container_id, None) | ||||||||||||
| if task is not None and not task.done(): | ||||||||||||
| task.cancel() | ||||||||||||
| try: | ||||||||||||
| await task | ||||||||||||
| except (asyncio.CancelledError, Exception): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
| async def close(self) -> None: | ||||||||||||
| self._closed = True | ||||||||||||
| tasks = list(self._tasks.values()) | ||||||||||||
| self._tasks.clear() | ||||||||||||
| self._latest.clear() | ||||||||||||
| for task in tasks: | ||||||||||||
| if not task.done(): | ||||||||||||
| task.cancel() | ||||||||||||
| for task in tasks: | ||||||||||||
| try: | ||||||||||||
| await task | ||||||||||||
| except (asyncio.CancelledError, Exception): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
| async def _read_stream(self, container_id: str) -> None: | ||||||||||||
| short_cid = container_id[:7] | ||||||||||||
| container = DockerContainer(self._docker, id=container_id) | ||||||||||||
| try: | ||||||||||||
| async for frame in container.stats(stream=True): | ||||||||||||
| validated = _validate_stats_entry(frame) | ||||||||||||
| if validated is not None: | ||||||||||||
| self._latest[container_id] = validated | ||||||||||||
| except asyncio.CancelledError: | ||||||||||||
| raise | ||||||||||||
| except RuntimeError as e: | ||||||||||||
| msg = str(e.args[0]).lower() if e.args else "" | ||||||||||||
| if "event loop is closed" in msg or "session is closed" in msg: | ||||||||||||
| return | ||||||||||||
| log.warning( | ||||||||||||
| "stats stream stopped unexpectedly (cid:{}): {!r}", | ||||||||||||
| short_cid, | ||||||||||||
| e, | ||||||||||||
| ) | ||||||||||||
| except (DockerError, aiohttp.ClientError) as e: | ||||||||||||
| log.debug( | ||||||||||||
| "stats stream ended (cid:{}): {!r}", | ||||||||||||
| short_cid, | ||||||||||||
| e, | ||||||||||||
| ) | ||||||||||||
| finally: | ||||||||||||
| self._latest.pop(container_id, None) | ||||||||||||
|
||||||||||||
| self._latest.pop(container_id, None) | |
| self._latest.pop(container_id, None) | |
| current_task = asyncio.current_task() | |
| if current_task is not None and self._tasks.get(container_id) is current_task: | |
| self._tasks.pop(container_id, None) |
Copilot
AI
Apr 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPUPlugin creates its own DockerStatsStreamer (and its own Docker client). Since MemoryPlugin does the same, this will open two long-lived stats(stream=True) connections per container (one per plugin), which can negate the intended resource savings and put unnecessary load on dockerd for hosts with many containers. Consider sharing a single streamer (and ideally a single Docker client) across intrinsic plugins, e.g., via an agent-level/shared context or a module-level singleton keyed by the underlying Docker client.
Copilot
AI
Apr 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryPlugin also instantiates its own DockerStatsStreamer, which (together with CPUPlugin’s streamer) results in duplicate per-container stats streams. To avoid 2× open connections per container, consider sharing a single streamer across intrinsic plugins (and/or reusing a shared Docker client).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DockerStatsStreamerintroduces new lifecycle/caching behavior (lazy task spawn, stream termination handling,close()cancellation), but the unit tests only mock it rather than exercising its real behavior. Adding focused tests forget_latest()/close()(and for stream termination cleaning up internal state like_tasks) would help prevent regressions and catch resource-leak scenarios early.