Skip to content

Random Freezes and Python Zombie Processes with cgroup v2 Setup #1264

@bahnwaerter

Description

@bahnwaerter

The current version of BenchExec (3.35-dev commit 5f507cc) exhibits sporadic freezes that are difficult to reproduce consistently when run with cgroups v2. During these freezes, multiple Python zombie processes appear, hindering progress. Despite the correct setup of cgroups v2 and proper subprocess handling in BenchExec, the underlying cause remains unclear.

Observations

  • BenchExec exhibits sporadic and unpredictable freezes, that lack a clear pattern or reproducibility. The freezes can occur after a few benchmark tasks or even after thousands.
  • When a freeze occurs, several Python zombie processes are observed, coinciding with the halt in progress:
    [...]
    
    115295 root      20   0       0      0      0 Z   0.0   0.0   0:00.00 [python3] <defunct>
    115392 root      20   0       0      0      0 Z   0.0   0.0   0:00.00 [python3] <defunct>
    115393 root      20   0       0      0      0 Z   0.0   0.0   0:00.00 [python3] <defunct>
    138277 root      20   0       0      0      0 Z   0.0   0.0   0:00.00 [python3] <defunct>
    
    [...]
    
  • The implementation in BenchExec uses os.wait4(...) and similar calls for the subprocess handling, which seem to work correctly. It remains unknown how or why zombie processes persist.
  • The cgroup v2 setup is properly configured and usable by BenchExec. The cleanup mechanisms (e.g., process termination in OOM scenarios) appear to operate as intended:
    [...]
    
    [244384.472902] Memory cgroup out of memory: Killed process 2316616 (java) total-vm:10681332kB, anon-rss:7773952kB, file-rss:18656kB, shmem-rss:0kB, UID:0 pgtables:15772kB oom_score_adj:0
    [244384.473313] Tasks in /lxc/802/ns/user.slice/user-0.slice/user@0.service/benchexec.slice/run-p115037-i6497282.scope/benchmark_0n5o4jto/delegate_b6gc73pn are going to be killed due to memory.oom.group set
    [244384.473333] Memory cgroup out of memory: Killed process 2316616 (java) total-vm:10681332kB, anon-rss:7773952kB, file-rss:18656kB, shmem-rss:0kB, UID:0 pgtables:15772kB oom_score_adj:0
    
    [...]
    
  • Running BenchExec with --debug flag yields no additional warnings or errors.

Environment

  • BenchExec runs inside an LXC container (Debian Trixie) on a Proxmox 9 (Debian Trixie) host
  • Host has an AMD Ryzen Threadripper 3970X CPU with 128 GB of RAM allocated to the LXC container
  • Linux kernel version: 6.17.13-2-pve
  • cgroup v2 is enabled with nesting
  • systemd version: 257.9
  • Python version: 3.13.5

Cause

So far, it is unclear whether the underlying issue lies in the Linux kernel, LXC, or cgroup v2 layer, or even in the BenchExec implementation itself (e.g., due to a possible race condition in the subprocess handling or any other bugs).

Are there any known similar issues with BenchExec using cgroups v2 setups?
What additional debugging steps can be taken?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions