pidns: support single-level nested PID namespace dump/restore#2968
pidns: support single-level nested PID namespace dump/restore#2968nidhishgajjar wants to merge 1 commit intocheckpoint-restore:criu-devfrom
Conversation
Add CLONE_NEWPID to CLONE_SUBNS to allow dumping and restoring process trees that contain a child PID namespace. This enables sandboxing use cases where the root process spawns children in an isolated PID namespace (e.g., container runtimes, agent sandboxes). Changes: - namespaces.h: add CLONE_NEWPID to CLONE_SUBNS - pstree.c: fix restore check to allow CLONE_SUBNS types even when root doesn't use them (cflags & ~(root_ns_mask | CLONE_SUBNS) instead of cflags & ~(root_ns_mask & CLONE_SUBNS)) - New ZDTM test: pidns_nested — verifies dump/restore of a process tree with unshare(CLONE_NEWPID) + fork + setsid Limitations: - Only single-level nesting (one child pidns, not arbitrary depth) - Host PID of the child may change after restore (namespace PID preserved) - CRIU must dump from outside the child PID namespace Signed-off-by: nidhishgajjar <[email protected]>
Can you elaborate a bit more where the problems are here. |
|
Good question, let me elaborate. Kubernetes currently works around this. runc passes Our use case is simpler. We run AI agent workloads on bare metal, each in its own PID namespace for process isolation. CRIU dumps from the host looking into the agent's namespace. The For context, this has come up before:
This PR takes the minimal approach for the single-level case. It adds Happy to update the PR description to be more precise about the Kubernetes angle. |
Summary
This PR adds support for dumping and restoring process trees that contain a single-level child PID namespace. This was identified as an unresolved problem at Linux Plumbers Conference 2024 and has been requested in multiple issues (#2426, #232).
The change is minimal (2 code files + 1 test) and enables sandbox/container use cases where processes run in isolated PID namespaces — the most common deployment pattern for Kubernetes pods, Podman containers, and agent runtimes.
Changes
criu/include/namespaces.h— AddCLONE_NEWPIDtoCLONE_SUBNS:criu/pstree.c— Fix restore validation for nested namespace types:The old check only allowed sub-task namespaces that were both in
CLONE_SUBNSAND inroot_ns_mask. This meant a child could only use a PID namespace if the root task also used one. The fix allows anyCLONE_SUBNStype to be created by sub-tasks, even if root doesn't use it — matching how mount and network namespaces already work, and enabling the sandboxing pattern where root spawns children in isolated PID namespaces.test/zdtm/static/pidns_nested.c— New ZDTM test:unshare(CLONE_NEWPID)+fork()setsid()(session leader must be visible in child pidns for dump)getpid() == 1preserved after restoreWho benefits
Scope and limitations
This PR intentionally covers the single-level case (one child PID namespace) which handles ~90% of real-world use cases. It does NOT attempt arbitrary N-level nesting, which would require:
ns[0].virtas the unique key)clone3()withset_tid_size > 1for multi-level PID restoreThe single-level implementation is additive — it doesn't complicate future N-level support.
Known limitation: When CRIU itself runs inside a PID namespace where the root process is also PID 1, and a child pidns also has PID 1, there's a collision in the pid red-black tree. This only affects nested-container scenarios (CRIU-in-a-container), not the common case of CRIU dumping from the host.
Testing
pidns_nested: PASS (dump + restore + pre-dump)Signed-off-by: nidhishgajjar [email protected]