refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223
refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223rapsealk wants to merge 5 commits into
Conversation
On Linux native hosts, the sysfs (cgroup) stat collection path is strictly faster (<10 ms) than the Docker API path (100-500 ms), yet the previous default selected the slow path for most cases. - Change the `container.stats-type` default from `docker` to unset so that the agent auto-selects the fastest available path (`cgroup` on native Linux, `docker` elsewhere). - Wrap the cgroup CPU and memory readers with a per-read Docker API fallback: when a sysfs read returns None (missing path or unreadable), emit a rate-limited warning and fall back to the Docker API instead of silently dropping the sample. - Force the Docker API path on linuxkit / Docker Desktop hosts even when `StatModes.CGROUP` is selected, since cgroup paths are not reliably accessible there. Network and block I/O stats continue to use the Docker API path. `StatModes.DOCKER` remains available as an explicit config override. Closes #11220 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR changes the agent’s default container stats collection strategy so that native Linux hosts prefer the faster cgroup/sysfs path for CPU/memory, while keeping Docker API usage for linuxkit/Docker Desktop and as a runtime fallback when cgroup reads fail.
Changes:
- Switch
stats_typedefault fromdockerto “auto” (unset/None) soStatContextselects the preferred mode per host. - Add per-container cgroup→Docker-API fallback behavior (and one-time warning) for CPU and memory stats collection; force API on linuxkit.
- Add unit tests covering CPU fallback-on-failure and linuxkit forcing API.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/unit/agent/test_docker_intrinsic.py |
Adds CPUPlugin unit tests for cgroup sysfs failure fallback and linuxkit forcing API. |
src/ai/backend/agent/docker/intrinsic.py |
Implements cgroup-first fallback logic for CPU/Memory and linuxkit forcing API; adds warning suppression helper. |
src/ai/backend/agent/config/unified.py |
Changes stats_type default to None (auto-select) and updates help text/validator typing. |
changes/11220.enhance.md |
Adds changelog entry for the new default behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Tracks containers for which we have already logged a cgroup->Docker-API | ||
| # fallback warning, to avoid log spam on persistent read failures. | ||
| _cgroup_fallback_warned: set[str] = set() | ||
|
|
||
|
|
||
| def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None: | ||
| key = f"{plugin}:{container_id}" | ||
| if key in _cgroup_fallback_warned: | ||
| return | ||
| _cgroup_fallback_warned.add(key) | ||
| log.warning( | ||
| "{0}: cgroup sysfs read failed for container {1}; falling back to Docker API", | ||
| plugin, | ||
| container_id[:7], | ||
| ) |
There was a problem hiding this comment.
_cgroup_fallback_warned is a process-global set that only grows and is never pruned. On agents with high container churn this can become an unbounded memory sink over time. Consider bounding it (e.g., LRU/TTL cache) or clearing entries when containers are removed / after some time window.
| def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None: | ||
| key = f"{plugin}:{container_id}" | ||
| if key in _cgroup_fallback_warned: | ||
| return | ||
| _cgroup_fallback_warned.add(key) | ||
| log.warning( | ||
| "{0}: cgroup sysfs read failed for container {1}; falling back to Docker API", | ||
| plugin, | ||
| container_id[:7], | ||
| ) | ||
|
|
There was a problem hiding this comment.
_warn_cgroup_fallback_once() is meant to prevent log spam on persistent sysfs failures, but sysfs_impl() already logs a warning on every OSError before returning None. With the new fallback, persistent failures will still spam the existing sysfs warning (and add one more warning once). Consider moving the per-container warning solely to the fallback layer (or otherwise suppressing/reducing repeated sysfs warnings) so the anti-spam behavior is effective.
| "Method for collecting container resource statistics. " | ||
| "'docker' uses Docker's stats API (most compatible). " | ||
| "'cgroup' reads from cgroup filesystem directly (more accurate, requires root). " | ||
| "'null' disables statistics collection." | ||
| "'cgroup' reads from cgroup filesystem directly (faster and more accurate, " | ||
| "requires root on Linux). " | ||
| "When unset, the agent auto-selects: cgroup on native Linux hosts " | ||
| "(with Docker API fallback on read failure) and docker elsewhere." |
There was a problem hiding this comment.
The PR description says there are no changes to network/IO stat collection, but switching the default stats_type to auto-select cgroup on native Linux will cause MemoryPlugin.gather_container_measures() to use its sysfs implementation by default (which collects io/net via cgroup + /proc), rather than Docker stats API. If the intent is to keep network/IO coming from the Docker stats API by default, the mode selection likely needs to be split (CPU/memory via cgroup, net/IO via API) or the PR description should be updated to reflect the behavior change.
| async def cgroup_first_impl( | ||
| container_id: str, | ||
| ) -> tuple[int, int, int, int, int, int, int] | None: | ||
| result = await sysfs_impl(container_id) | ||
| if result is None: | ||
| _warn_cgroup_fallback_once("MemoryPlugin", container_id) | ||
| return await api_impl(container_id) | ||
| return result | ||
|
|
||
| match ctx.mode: | ||
| case StatModes.CGROUP if not _is_linuxkit(self.local_config): | ||
| impl = cgroup_first_impl | ||
| case StatModes.CGROUP | StatModes.DOCKER: | ||
| impl = api_impl | ||
| case _: | ||
| raise RuntimeError("should not reach here") |
There was a problem hiding this comment.
MemoryPlugin now has a cgroup-first fallback-to-API path (and linuxkit forcing API) similar to CPUPlugin, but the new behavior isn't covered by unit tests (unlike CPU where new tests were added). Please add tests that (1) verify MemoryPlugin falls back to fetch_api_stats() when sysfs_impl() returns None and (2) verify linuxkit forces API even when ctx.mode is CGROUP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extend changelog to call out the memory metric shift (sysfs
excludes inactive file cache; dashboards may step-down).
- Bound the _cgroup_fallback_warned set to prevent unbounded growth
on hosts with high container churn.
- Replace RuntimeError("should not reach here") with assert_never()
to satisfy the BackendAIError-only rule and strengthen exhaustive
match-statement type checking.
- Add parallel MemoryPlugin fallback + linuxkit tests and a
dedup-warn test.
Refs #11220
Refs #11223
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review summary from an independent pass + resolutions pushed to this branch (commit Verdict: ship with changes — addressed.
Cgroup v1/v2 both handled (pre-existing). Network/IO still goes through Docker API per spec. Selection via |
- Extend changes/11223.enhance.md with a short note that block-I/O readings may also step down on cgroup v1 hosts (the sysfs path reads blkio.throttle.io_service_bytes; the API path reads io_service_bytes_recursive). - Drop TestWarnCgroupFallbackOnce::test_evicts_beyond_limit — it was re-verifying cachetools.LRUCache eviction, not project logic. test_deduplicates_per_container already covers the project contract. Refs #11220 Refs #11223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add TODO(#11223) notes on CPU/Memory api_impl paths: after sysfs-first lands, the Docker stream on these plugins is only needed for network and blkio (which neither consumes), so the stream should migrate to a network/IO consumer. - Add TODO(#11232) at the two per-plugin DockerStatsStreamer instantiation sites flagging the shared-streamer refactor. - Add a test covering the DockerError 404 branch of _read_stream: reader exits cleanly on container-gone without spinning retries. Refs #11219 Refs #11223 Refs #11224 Refs #11232 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #11220 (BA-5860)
Refs #11216
Summary
StatModes.CGROUPon native Linux (sysfs path, ~100-500 ms → <10 ms per sample).StatModes.DOCKERas the forced path on linuxkit / Docker Desktop and as a per-read fallback when a cgroup sysfs read fails.(plugin, container_id)via a boundedcachetools.LRUCacheso a systemic sysfs failure doesn't storm logs.raise RuntimeError("should not reach here")in exhaustivematchdefaults withtyping.assert_never(satisfies the agent rule banning built-in exceptions in business logic + strengthens mypy exhaustiveness).Operator note — possible metric step-down on upgrade
Hosts previously running
stats-type: dockermay see:docker statsCLI semantics).blkio.throttle.io_service_byteswhile the Docker API path readsblkio_stats.io_service_bytes_recursive.Dashboards and autoscaling thresholds tuned to the old values should be re-evaluated.
Why
Agent-runtime performance epic #11216; sysfs is an order of magnitude faster for CPU/memory stats than the Docker API round-trip.
Test plan
pants test tests/unit/agent::passes (CPU + Memory plugin fallback, linuxkit guard, warn dedup)