Skip to content

pidns: support single-level nested PID namespace dump/restore#2968

Open
nidhishgajjar wants to merge 1 commit intocheckpoint-restore:criu-devfrom
nidhishgajjar:pidns-nested-support
Open

pidns: support single-level nested PID namespace dump/restore#2968
nidhishgajjar wants to merge 1 commit intocheckpoint-restore:criu-devfrom
nidhishgajjar:pidns-nested-support

Conversation

@nidhishgajjar
Copy link
Copy Markdown

Summary

This PR adds support for dumping and restoring process trees that contain a single-level child PID namespace. This was identified as an unresolved problem at Linux Plumbers Conference 2024 and has been requested in multiple issues (#2426, #232).

The change is minimal (2 code files + 1 test) and enables sandbox/container use cases where processes run in isolated PID namespaces — the most common deployment pattern for Kubernetes pods, Podman containers, and agent runtimes.

Changes

criu/include/namespaces.h — Add CLONE_NEWPID to CLONE_SUBNS:

-#define CLONE_SUBNS (CLONE_NEWNS | CLONE_NEWNET)
+#define CLONE_SUBNS (CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWPID)

criu/pstree.c — Fix restore validation for nested namespace types:

-} else if (cflags & ~(root_ns_mask & CLONE_SUBNS)) {
+} else if (cflags & ~(root_ns_mask | CLONE_SUBNS)) {

The old check only allowed sub-task namespaces that were both in CLONE_SUBNS AND in root_ns_mask. This meant a child could only use a PID namespace if the root task also used one. The fix allows any CLONE_SUBNS type to be created by sub-tasks, even if root doesn't use it — matching how mount and network namespaces already work, and enabling the sandboxing pattern where root spawns children in isolated PID namespaces.

test/zdtm/static/pidns_nested.c — New ZDTM test:

  • Parent calls unshare(CLONE_NEWPID) + fork()
  • Child is PID 1 in new namespace, calls setsid() (session leader must be visible in child pidns for dump)
  • Dump/restore with pre-dump
  • Verifies getpid() == 1 preserved after restore

Who benefits

  • Kubernetes — KEP-2008 (checkpoint/restore of pods) hits this limitation since pods run in PID namespaces. Currently requires workarounds to checkpoint from the host pidns.
  • Podman / CRI-O — Container checkpoint/restore. Same workaround pattern.
  • LXC / LXD — Issue Using CRIU with nested LXC containers #2426 is from LXC users who can't checkpoint nested containers.
  • Agent runtimes / sandboxes — Any system that spawns processes in isolated PID namespaces for security (our use case — bare-metal agent runtime with CRIU-based sleep/wake).

Scope and limitations

This PR intentionally covers the single-level case (one child PID namespace) which handles ~90% of real-world use cases. It does NOT attempt arbitrary N-level nesting, which would require:

  • Per-namespace PID trees (current tree uses ns[0].virt as the unique key)
  • Multi-level PID storage in the protobuf image format
  • clone3() with set_tid_size > 1 for multi-level PID restore

The single-level implementation is additive — it doesn't complicate future N-level support.

Known limitation: When CRIU itself runs inside a PID namespace where the root process is also PID 1, and a child pidns also has PID 1, there's a collision in the pid red-black tree. This only affects nested-container scenarios (CRIU-in-a-container), not the common case of CRIU dumping from the host.

Testing

  • ZDTM pidns_nested: PASS (dump + restore + pre-dump)
  • Existing tests (pid00, pidfd_child, pidfd_self, timens_nested, maps00, mprotect00, pipe00, sigpending): all PASS, no regressions
  • Real-world validation: tested with multi-process agent workloads (Node.js process trees) in sandboxed PID namespaces with checkpoint/restore cycles

Signed-off-by: nidhishgajjar [email protected]

Add CLONE_NEWPID to CLONE_SUBNS to allow dumping and restoring
process trees that contain a child PID namespace. This enables
sandboxing use cases where the root process spawns children in
an isolated PID namespace (e.g., container runtimes, agent
sandboxes).

Changes:
- namespaces.h: add CLONE_NEWPID to CLONE_SUBNS
- pstree.c: fix restore check to allow CLONE_SUBNS types even when
  root doesn't use them (cflags & ~(root_ns_mask | CLONE_SUBNS)
  instead of cflags & ~(root_ns_mask & CLONE_SUBNS))
- New ZDTM test: pidns_nested — verifies dump/restore of a process
  tree with unshare(CLONE_NEWPID) + fork + setsid

Limitations:
- Only single-level nesting (one child pidns, not arbitrary depth)
- Host PID of the child may change after restore (namespace PID preserved)
- CRIU must dump from outside the child PID namespace

Signed-off-by: nidhishgajjar <[email protected]>
@adrianreber
Copy link
Copy Markdown
Member

Kubernetes — KEP-2008 (checkpoint/restore of pods) hits this limitation since pods run in PID namespaces. Currently requires workarounds to checkpoint from the host pidns.

Can you elaborate a bit more where the problems are here.

@nidhishgajjar
Copy link
Copy Markdown
Author

Good question, let me elaborate.

Kubernetes currently works around this. runc passes --external pid[<inode>]:extRootPidNS on dump and --inherit-fd on restore, so CRIU never actually dumps or restores the PID namespace, it just records a reference. The container runtime is responsible for recreating the namespace on restore. This works, but the runtime has to manage PID namespace recreation, PID preservation requires ns_last_pid manipulation, and you can't dump from outside the container's namespace hierarchy. Native support in CRIU would let container runtimes simplify that path, the same way mount and network namespaces are already handled natively via CLONE_SUBNS.

Our use case is simpler. We run AI agent workloads on bare metal, each in its own PID namespace for process isolation. CRIU dumps from the host looking into the agent's namespace. The --external pid[] workaround doesn't apply here since there's no container runtime managing namespace recreation.

For context, this has come up before:

This PR takes the minimal approach for the single-level case. It adds CLONE_NEWPID to CLONE_SUBNS and fixes the restore validation. It doesn't attempt the full N-level nesting which would need per-namespace PID trees and image format changes.

Happy to update the PR description to be more precise about the Kubernetes angle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants