Skip to content

Expose cgroups v2 pseudo-filesystem inside container (Ubuntu Noble) #384

@lodener

Description

@lodener

Current behavior

We noticed that with the switch to the new Noble stemcell (cgroups v2) for our cf-deployment, containers created through garden-runc-release's default containerd-mode no longer seem to expose the container's cgroups through /sys/fs/cgroup, as was the case when using the Jammy stemcell (cgroups v1). The cgroups per container are still available on the host, with /proc/${PID}/cgroup inside the container pointing at the path on the host.

However, with the java-buildpack for instance, the JVM looks at the cgroups to determine max memory/CPU, and on Noble it now falls back to the host diego-cell's total memory/CPU values, unaware of the shares assigned to the instance itself through cgroups due to them not being exposed inside the container.

Using a simple RuntimeInfo class showing the difference on the java buildpack (4.77.0), on a 8 core/64GB machine:

Jammy

$ mount|grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755,uid=4294967294,gid=4294967294,inode64)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/misc type cgroup (ro,nosuid,nodev,noexec,relatime,misc)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
$ ls /sys/fs/cgroup/
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  misc  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd
$ /home/vcap/app/.java-buildpack/open_jdk_jre/bin/java RuntimeInfo
availableProcessors: 8
freeMemory:  15.9 MB
maxMemory:   255.3 MB
totalMemory: 17.4 MB
Checking OperatingSystemMXBean
OperatingSystemMXBean.getAvailableProcessors: 8
OperatingSystemMXBean.getTotalPhysicalMemorySize: 1.0 GB
OperatingSystemMXBean.getFreePhysicalMemorySize: 652.0 MB
OperatingSystemMXBean.getTotalSwapSpaceSize: 0 B
OperatingSystemMXBean.getFreeSwapSpaceSize: 0 B
OperatingSystemMXBean.getSystemCpuLoad: 0.000000

Noble

$ mount|grep cgroup
$ ls /sys/fs/cgroup/
$ /home/vcap/app/.java-buildpack/open_jdk_jre/bin/java RuntimeInfo
availableProcessors: 8
freeMemory:  1014.2 MB
maxMemory:   15.7 GB
totalMemory: 1.0 GB
Checking OperatingSystemMXBean
OperatingSystemMXBean.getAvailableProcessors: 8
OperatingSystemMXBean.getTotalPhysicalMemorySize: 62.8 GB
OperatingSystemMXBean.getFreePhysicalMemorySize: 10.0 GB
OperatingSystemMXBean.getTotalSwapSpaceSize: 62.8 GB
OperatingSystemMXBean.getFreeSwapSpaceSize: 62.3 GB
OperatingSystemMXBean.getSystemCpuLoad: 0.000000

Regarding CPU matching the host: "Note: In versions 18.0.2+, 17.0.5+ and 11.0.17+, OpenJDK will no longer take CPU shares settings into account for its calculation of available CPU cores. See JDK-8281181 for details."

The containerd docs seem to recommend using the systemd cgroup driver (SystemdCgroup = true), but manually adding this to containerd.toml and restarting garden doesn't appear to change the behaviour.

First reported through slack.

Desired behavior

Exposing, perhaps optionally/configurable, the cgroups pseudo file system through /sys/fs/cgroup inside the container itself.

Affected Version

1.75.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions