Skip to content

cgroup: preserve cgroup2 superblock options on dump#3036

Open
rocker-zhang wants to merge 1 commit into
checkpoint-restore:criu-devfrom
rocker-zhang:cgroup2-preserve-sb-opts
Open

cgroup: preserve cgroup2 superblock options on dump#3036
rocker-zhang wants to merge 1 commit into
checkpoint-restore:criu-devfrom
rocker-zhang:cgroup2-preserve-sb-opts

Conversation

@rocker-zhang

Copy link
Copy Markdown
Contributor

criu reconfigures the host cgroup2 mount during checkpoint, dropping its superblock options.

When collecting cgroups for dump, __new_open_cgroupfs() opens the unified cgroup2 hierarchy with fsopen("cgroup2") and then cr_fsconfig(FSCONFIG_CMD_CREATE) (or, without fsopen support, mount("none", ..., "cgroup2", 0, NULL)), neither of which carries any superblock option. cgroup2 has a single superblock shared by every mount, so this reconfigures the live host mount and drops options criu did not set, such as nsdelegate, memory_recursiveprot and memory_hugetlb_accounting. The detached-mount teardown only removes criu's own mount instance, so the host cgroup2 is left altered after a checkpoint (#3029).

The fix reads the options of the existing cgroup2 mount from /proc/self/mountinfo and replays the known superblock flags before FSCONFIG_CMD_CREATE (and passes them as mount data in the fsopen-less path), leaving the host mount intact.

I reproduced this in a throwaway VM (its own kernel, so the mutated superblock is the VM's): the current sequence strips nsdelegate,memory_recursiveprot from an existing cgroup2 mount, and the replay preserves them. Builds on x86_64 and aarch64.

Two things I would like direction on. The fix replays an allowlist of known superblock flags; an alternative is to clone the existing mount with open_tree(OPEN_TREE_CLONE) and avoid reconfiguring the superblock at all, which is a larger change. Separately, preserving these options across checkpoint/restore (rather than only avoiding the host mutation) would need a new field in cg_controller_entry, which is an image-format change I have left out of this PR.

When collecting cgroups for dump, criu opens the unified cgroup2
hierarchy with fsopen("cgroup2") followed by FSCONFIG_CMD_CREATE (and,
without fsopen support, with mount("none", ..., "cgroup2", 0, NULL)).
Neither carries any superblock option. Because cgroup2 has a single
superblock shared by every mount, this reconfigures the live host mount
and drops options that criu did not set, such as nsdelegate,
memory_recursiveprot and memory_hugetlb_accounting. The detached-mount
teardown only removes criu's own mount instance and does not restore the
superblock, so the host cgroup2 mount is left altered after a checkpoint.

Read the options of the existing cgroup2 mount from /proc/self/mountinfo
and replay the known superblock flags before FSCONFIG_CMD_CREATE (and
pass them as mount data in the fsopen-less path), so the shared
superblock keeps its options and the host mount is left intact.

Addresses the host-side mutation reported in checkpoint-restore#3029. Preserving these
options across checkpoint/restore is a separate change.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rocker Zhang <zhang.rocker.liyuan@gmail.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.35294% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.10%. Comparing base (fc29bfe) to head (ba6eead).

Files with missing lines Patch % Lines
criu/cgroup.c 82.35% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #3036      +/-   ##
============================================
+ Coverage     57.04%   57.10%   +0.06%     
============================================
  Files           154      154              
  Lines         40534    40566      +32     
  Branches       8882     8892      +10     
============================================
+ Hits          23123    23167      +44     
+ Misses        17057    17045      -12     
  Partials        354      354              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@avagin avagin requested a review from Snorch June 10, 2026 14:47
@avagin

avagin commented Jun 10, 2026

Copy link
Copy Markdown
Member

@Snorch could you please review this pr?

@Snorch

Snorch commented Jun 15, 2026

Copy link
Copy Markdown
Member

The problem reproduces:

cat /proc/self/mountinfo  | grep cgroup2
48 46 0:30 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:7 - cgroup2 cgroup2 rw,seclabel,nsdelegate,memory_recursiveprot,memory_hugetlb_accounting

python3 test/zdtm.py run -t zdtm/static/cgroup00 --ignore-taint

cat /proc/self/mountinfo  | grep cgroup2
48 46 0:30 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:7 - cgroup2 cgroup2 rw,seclabel

@Snorch

Snorch commented Jun 15, 2026

Copy link
Copy Markdown
Member

Parsing mountinfo is generally slow (especially host's mountinfo). Preferably we want to avoid doing this one more time (in ideal world we should be switching to listmount+statmount everywhere).

What about alternative approach:

int fd = open("/sys/fs/cgroup", O_PATH);
struct mnt_id_req req = { .size = sizeof(req), .mnt_fd = fd,
                          .param = STATMOUNT_MNT_OPTS };
statmount(&req, sm, bufsz, STATMOUNT_BY_FD);

We will have sb opts in sm->mnt_opts.

Note: Need to verify that it's cgroup2 mount and not something else from other stamount fields.

Yes it only works when cgroup-v2 is mounted into /sys/fs/cgroup on host (which is pretty much always), overwise we have no other option except searching for it which should probably be done via listmount/statmount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants