Skip to content

fix(BA-5519): read container net stats from /proc/[pid]/net/dev with netns fallback#10696

Open
fregataa wants to merge 10 commits intomainfrom
BA-5519
Open

fix(BA-5519): read container net stats from /proc/[pid]/net/dev with netns fallback#10696
fregataa wants to merge 10 commits intomainfrom
BA-5519

Conversation

@fregataa
Copy link
Copy Markdown
Member

@fregataa fregataa commented Mar 31, 2026

Summary

  • Replace netstat_ns() + setns() with direct /proc/[container_pid]/net/dev reading for container network stats in cgroup mode
  • Root cause: agent runs as daemon subprocess → ProcessPoolExecutor unavailable → thread pool fallback → setns() changes thread namespace but psutil reads /proc/self/net/dev (process-level = host namespace)
  • Remove dead code: netstat_ns(), netstat_ns_work(), unused imports (multiprocessing, ProcessPoolExecutor, nsenter)

Fixes BA-5519

Test plan

  • Verify container net_rx/net_tx values match docker stats output
  • Verify /proc/[container_pid]/net/dev shows only container interfaces (lo, eth0)
  • Verify PID=0 case (stopped container) returns 0 with warning log
  • Run existing unit tests for agent intrinsic plugins

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 31, 2026 15:20
@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component labels Mar 31, 2026
@fregataa fregataa marked this pull request as draft March 31, 2026 15:23
fregataa added a commit that referenced this pull request Mar 31, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the agent’s container network-stat collection in CGROUP mode to avoid unreliable per-thread namespace switching by reading /proc/<container_pid>/net/dev directly.

Changes:

  • Replace netstat_ns()/setns()/psutil-based namespace reads with a /proc/<pid>/net/dev parser (read_proc_net_dev()).
  • Update MemoryPlugin CGROUP-mode net stat collection to use container PID from container.show()["State"]["Pid"].
  • Refactor unit tests to target PID-based net stat collection and remove the old namespace-switching test coverage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/ai/backend/agent/docker/intrinsic.py Introduces /proc/<pid>/net/dev parsing and switches MemoryPlugin CGROUP net stats to use container PID instead of SandboxKey/netns.
tests/unit/agent/test_docker_intrinsic.py Updates tests for PID-based net stats, adds PID validation cases, and removes netns work tests (but still contains outdated netns-based fixtures/tests that need updating).
Comments suppressed due to low confidence (1)

tests/unit/agent/test_docker_intrinsic.py:196

  • memory_cgroup_context mocks current_loop().run_in_executor to always return 0, but sysfs_impl() now awaits run_in_executor(..., read_proc_net_dev, pid) and unpacks a (rx, tx) tuple. With the current mock this will raise during tuple-unpacking. Adjust the mock to call the provided function (fn(*args)) or return a (0, 0) tuple for read_proc_net_dev while still returning an int for get_scratch_size.
            patch(
                "ai.backend.agent.docker.intrinsic.read_proc_net_dev",
                return_value=(0, 0),
            ),
            patch(
                "ai.backend.agent.docker.intrinsic.current_loop",
            ) as mock_loop,
        ):
            mock_container_instance = AsyncMock()
            mock_container_instance.show.return_value = mock_container_data
            mock_container_cls.return_value = mock_container_instance
            mock_loop.return_value.run_in_executor = AsyncMock(return_value=0)
            yield ctx

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fregataa fregataa added this to the 25.15 milestone Mar 31, 2026
@fregataa fregataa changed the title fix(agent): read /proc/[pid]/net/dev for container net stats instead of setns+psutil fix(BA-5519): read /proc/[pid]/net/dev for container net stats instead of setns+psutil Mar 31, 2026
fregataa added a commit that referenced this pull request Mar 31, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fregataa added a commit that referenced this pull request Apr 1, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa changed the title fix(BA-5519): read /proc/[pid]/net/dev for container net stats instead of setns+psutil fix(BA-5519): read container net stats from /proc/[pid]/net/dev with netns fallback Apr 1, 2026
fregataa added a commit that referenced this pull request Apr 1, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa marked this pull request as ready for review April 1, 2026 08:02
@fregataa fregataa requested a review from a team April 1, 2026 08:02
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems fine, but I think we'll need @HyeockJinKim's review as well.

Comment on lines +583 to +618
def test_parses_standard_format(self, tmp_path: Path) -> None:
"""Parses standard /proc/net/dev format: skips headers, excludes lo,
returns correct rx/tx sums."""
net_dev = tmp_path / "net_dev"
net_dev.write_text(self.SAMPLE_NET_DEV)
with patch(
"ai.backend.agent.docker.intrinsic.Path",
return_value=net_dev,
):
result = read_proc_net_dev(42)
assert result.rx_bytes == 50000
assert result.tx_bytes == 80000

def test_sums_multiple_interfaces(self, tmp_path: Path) -> None:
"""Sums rx/tx bytes across all non-loopback interfaces."""
net_dev = tmp_path / "net_dev"
net_dev.write_text(self.MULTI_IFACE_NET_DEV)
with patch(
"ai.backend.agent.docker.intrinsic.Path",
return_value=net_dev,
):
result = read_proc_net_dev(42)
assert result.rx_bytes == 40000 # 10000 + 30000
assert result.tx_bytes == 60000 # 20000 + 40000

def test_loopback_only_returns_zero(self, tmp_path: Path) -> None:
"""When only loopback is present, returns (0, 0)."""
net_dev = tmp_path / "net_dev"
net_dev.write_text(self.LO_ONLY_NET_DEV)
with patch(
"ai.backend.agent.docker.intrinsic.Path",
return_value=net_dev,
):
result = read_proc_net_dev(42)
assert result.rx_bytes == 0
assert result.tx_bytes == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test can be generalized with parametrize

fregataa and others added 9 commits April 2, 2026 11:10
…d of using setns

Replace netstat_ns()/netstat_ns_work() namespace-switching approach with
direct /proc/[container_pid]/net/dev reads. The old approach was unreliable
because setns() only changes the calling thread's namespace, but
psutil.net_io_counters() reads /proc/self/net/dev which reflects the
process-level (host) namespace, causing inflated network statistics.

The new read_proc_net_dev() function reads the container's /proc entry
directly using the container PID from Docker inspect (State.Pid), which
works correctly from any thread without namespace switching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…at_ns

- Remove TestNetstatNsWork class (tested dead code that used setns)
- Replace TestMemoryPluginNamespaceValidation with
  TestMemoryPluginContainerPidValidation (PID=0, valid PID, OSError cases)
- Update _SysfsMocks to use State.Pid + read_proc_net_dev instead of
  SandboxKey + netstat_ns
- Add TestReadProcNetDev unit tests: standard format parsing, multi-interface
  sum, loopback exclusion, nonexistent PID error
- Update run_in_executor mocks to delegate to actual functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d-self/net/dev

When container PID is unavailable (PID=0), fall back to SandboxKey +
setns() + /proc/thread-self/net/dev instead of skipping net stats.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ainer PID

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa requested a review from seedspirit April 2, 2026 02:11
@fregataa fregataa requested a review from a team April 2, 2026 02:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants