Alexandre Fiori, March 2026
This document compares sdme's isolation model with Docker and Podman.
For sdme's security implementation details (capabilities, seccomp,
AppArmor, --hardened, --strict), see
architecture.md, Section 14.
sdme's security model is different from Docker and Podman by design. The full systemd init environment inside every container is not a compromise; it is the primary value proposition.
sdme containers are the Linux you know. They run systemd, journald, and
D-Bus. You manage services with systemctl, read logs with journalctl,
and configure the system with the tools you already use. There is no
container-specific runtime to learn, no custom logging driver, no
proprietary health check mechanism.
Even OCI applications imported with sdme fs import run on the rootfs of
your choice. You pick your Debian, your CentOS, your Fedora. The OCI app
is confined inside that environment as a systemd service
(sdme-oci-app.service), not as a standalone process.
sdme fs build lets you extend any base rootfs: install monitoring agents,
add custom services, configure systemd units. These services run alongside
the OCI workload but outside it, in the same container. Docker and Podman
have no equivalent: you either run one process per container or build
increasingly complex entrypoint scripts and agents around the container,
not in it. This sidecar functionality for your rootfs + OCI is key.
You still get the OCI packaging and distribution model. Pull images from any OCI registry, layer applications on base rootfs images. But at runtime, the container is managed by systemd, the most widely available service management framework on Linux.
--oci-pod confines the network namespace of the OCI application process
while keeping the container's systemd, journal, and D-Bus on their own
network. This gives you pod semantics for application networking without
sacrificing the container's management plane.
The trade-off is explicit: CAP_SYS_ADMIN and the @mount syscall group
cannot be dropped because systemd needs them. With --strict, this is
scoped to a user namespace; CAP_SYS_ADMIN does not grant host-level
privileges. Every other restriction matches or exceeds Docker defaults.
For environments that cannot accept CAP_SYS_ADMIN under any
circumstances, Docker or Podman is the right choice. For environments
that value operational familiarity, extensibility, and the full power
of systemd, sdme with --strict provides strong isolation while
preserving these benefits.
sdme, Docker, and Podman take fundamentally different approaches to container security.
sdme delegates baseline isolation to systemd-nspawn and provides opt-in
hardening layers: --hardened as a one-flag defense-in-depth bundle, plus
fine-grained controls for capabilities, seccomp, AppArmor, privilege
escalation, and read-only rootfs. It is designed for single-tenant machines
running full systemd inside containers. It requires root for all operations
and has no persistent daemon.
Docker applies defense-in-depth by default: reduced capabilities, a restrictive seccomp profile, a default AppArmor profile, and optional user namespace remapping. It is daemon-based (containerd) and designed for application containers that typically run a single process.
Podman provides similar defense-in-depth with rootless execution by default, SELinux integration on Fedora/RHEL, and a daemonless architecture. It is designed for OCI-compatible workflows and Docker CLI compatibility.
The key philosophical difference: Docker and Podman apply security layers by
default and require explicit opt-out. sdme provides the layers but requires
explicit opt-in (or --hardened for a sensible bundle). Out of the box,
sdme trusts its workloads; Docker and Podman do not.
Every container runtime uses Linux namespaces for isolation. The table below compares which namespaces each runtime enables and how.
+-----------------------+----------------------------+----------------------------+----------------------------+
| Namespace | sdme (nspawn) | Docker (runc, rootful) | Podman (crun, rootless) |
+-----------------------+----------------------------+----------------------------+----------------------------+
| PID | Always | Always | Always |
| IPC | Always | Always | Always |
| UTS | Always | Always | Always |
| Mount | Always | Always | Always |
| Network | Optional (host default) | Yes (bridge default) | Yes (slirp4netns/pasta) |
| User | Optional | Optional | Yes (default) |
| Cgroup | Partial (Delegate=yes) | Yes | Yes |
+-----------------------+----------------------------+----------------------------+----------------------------+
All three runtimes always create PID, IPC, UTS, and mount namespaces. In sdme, PID 1 inside the container is the container's systemd init, not the host's. Processes inside cannot see or signal host processes, IPC objects are isolated, the container has its own hostname, and the mount table is independent (built from an overlayfs mount on top of the rootfs).
sdme shares the host's network namespace by default for simplicity: no port
mapping, no bridge configuration, containers just work on the host's network
stack. This is equivalent to docker run --net=host.
Docker creates a private bridge network by default, providing network isolation out of the box. Podman rootless uses slirp4netns or pasta for unprivileged network namespace setup.
sdme provides network isolation via --private-network (or --hardened,
which enables it automatically). With --private-network, the container
gets its own network namespace with only a loopback interface. Connectivity
options (--network-veth, --network-bridge, --network-zone, --port)
build on top of that.
Without --userns, UID 0 inside the container is UID 0 on the host. A
container escape gives the attacker full root access. This is the default
for both sdme and rootful Docker.
With --userns (or --hardened), sdme passes --private-users=pick --private-users-ownership=auto to nspawn. Container root maps to a high
unprivileged UID on the host (524288+ range, deterministically hashed from
the machine name). An escape lands in an unprivileged context. On kernel
6.6+, overlayfs supports idmapped mounts, making this zero-overhead (files
stay UID 0 on disk). On older kernels, auto falls back to recursive chown
on first boot.
Podman rootless gets user namespace remapping by default; the entire container runtime runs as an unprivileged user.
sdme uses Delegate=yes in the systemd template unit. The container's
systemd gets its own cgroup subtree (machine.slice/sdme@<name>.service)
but can see the host cgroup hierarchy structure. Docker and Podman provide
full cgroup namespace isolation.
Capabilities determine what privileged operations a container's root user can perform.
Docker retains roughly 14 capabilities, the minimum needed for typical
application containers. Notably, CAP_SYS_ADMIN is excluded. Podman uses
the same default set as Docker.
Docker doesn't need CAP_SYS_ADMIN because Docker containers don't run a
full init system. This is a fundamental consequence of the different design:
sdme runs full systemd (requiring broad capabilities), while Docker runs
single-purpose application processes (requiring minimal capabilities).
sdme (via nspawn) retains 26 capabilities by default, including
CAP_SYS_ADMIN. See
architecture.md, Section 14 for the full
capability list and sdme's --drop-capability/--capability controls.
All three runtimes apply seccomp system call filters. The baseline restrictiveness differs because of their different design goals.
Docker's OCI default seccomp profile is more restrictive, blocking roughly
44 syscalls with a more conservative allowlist. Podman uses the same OCI
default profile. The most significant difference: Docker blocks mount()
and related syscalls, while nspawn must allow @mount because systemd
needs them during boot.
This means a compromised process inside sdme has access to more kernel surface than inside Docker. This is an inherent trade-off of running a full init system.
See architecture.md, Section 14 for
nspawn's baseline filter details and sdme's --system-call-filter controls.
Docker ships a default AppArmor profile (docker-default) that
restricts mount operations, /proc//sys writes, and cross-container
ptrace.
Podman has strong SELinux integration with svirt type enforcement
labels on Fedora/RHEL. AppArmor is used where available (Debian/Ubuntu).
sdme ships a default AppArmor profile (sdme-default) that is more
permissive than Docker's docker-default because sdme containers run a
full init system. Docker blocks mount operations entirely; sdme must allow
them for systemd to set up /proc, /sys, and tmpfs mounts during boot.
SELinux is not supported. sdme has no SELinux integration and does
not provide MAC confinement on SELinux-only systems (Fedora, RHEL).
During rootfs import (sdme fs import), security.selinux extended
attributes are explicitly skipped because they do not transfer
meaningfully between filesystems and would cause label conflicts on
the host. Docker and Podman provide MAC confinement out of the box on
both AppArmor and SELinux systems.
See architecture.md, Section 14 for the
sdme-default profile details and installation instructions.
Docker: no_new_privs enabled by default. Setuid binaries inside the
container cannot escalate privileges.
Podman: enabled by default.
sdme: off by default because interactive containers typically want
sudo/su to work. Enabled by --no-new-privileges, --hardened,
or --strict.
All three provide read-only rootfs as an opt-in flag (--read-only).
See architecture.md, Section 14 for
sdme's --no-new-privileges and --read-only implementation details.
By default, sdme containers share the host's network namespace. This is
equivalent to docker run --net=host. No network isolation exists: the
container can bind to any port, see all host interfaces, and communicate
with any network the host can reach.
This is the simplest mode and sufficient for most development use cases. The trade-off is explicit: zero network isolation in exchange for zero configuration.
--private-network gives the container its own network namespace with
only a loopback interface (no external connectivity). This is the foundation
for all other network options.
sdme create mybox --private-network
sdme create mybox --private-network --network-veth
sdme create mybox --private-network --network-zone=myzone
sdme create mybox --private-network --port=8080:80
When --private-network is active:
- The container gets
CAP_NET_ADMIN(safe, since it only affects the isolated namespace). - systemd-nspawn creates the network namespace and optionally sets up veth pairs, bridges, or zones.
- Port forwarding (
--port) maps host ports to container ports through nspawn's built-in NAT.
This is closest to Docker's default networking model.
--hardened enables --private-network automatically.
Pods give multiple containers a shared network namespace (see architecture.md, Section 10 for implementation details and lifecycle management).
Two mechanisms for joining a pod:
--pod (whole-container): The entire nspawn container runs in the pod's
network namespace via --network-namespace-path=. All processes, including
the container's systemd init, share the pod's network stack. This is the
general-purpose option.
--oci-pod (OCI app process only): The pod's netns is bind-mounted into
the container at /run/sdme/oci-pod-netns, and a systemd drop-in sets
NetworkNamespacePath= on the OCI app service. Only the application process
enters the pod's netns; the container's init and other services keep their
own network namespace. This is for OCI app containers that need pod
networking for their application but want systemd's own networking (e.g.
journal remote, D-Bus) to remain independent.
Comparison with Podman pods. Podman uses an "infra container" (a pause process) to hold the pod's network namespace. Podman pods support full external connectivity through slirp4netns/pasta or CNI/Netavark plugins. sdme pods are loopback-only by default with no built-in external connectivity mechanism.
Comparison with Docker Compose. Docker Compose creates shared bridge networks, not true pod semantics. Containers in a Compose service communicate via DNS names over a bridge, not via localhost. sdme pods are closer to Kubernetes pod semantics: shared localhost, shared ports.
The pod netns is created with only a loopback interface and no routes.
Containers in the pod can communicate via 127.0.0.1 but have no external
connectivity unless a veth or bridge is added to the netns externally.
When a container joins a pod with --pod, it does not use
--private-network. This means CAP_NET_ADMIN is not granted.
The container's root cannot add interfaces, modify routes, or change iptables
rules in the shared netns. CAP_SYS_ADMIN is present (required for systemd)
but the container's PID namespace prevents access to the host's network
namespace references through /proc.
sdme containers run a full systemd init system: PID 1 is systemd, with journald, logind, dbus-daemon, and any enabled services. A typical container has ~10 processes at idle. Docker containers typically run a single application process (PID 1).
More processes means more potential targets for exploitation, but also means
familiar operational tooling: systemctl, journalctl, loginctl.
Without --userns, UID 0 inside the container is UID 0 on the host. The
overlayfs upper layer files are owned by real host UIDs. With --userns,
container root maps to a high UID, so files are owned by that UID on the
host (or stay UID 0 on disk with idmapped mounts on kernel 6.6+).
Custom bind mounts (-b/--bind) expose host directories directly into
the container. A read-write bind mount gives container root full access to
those host files. Use :ro for read-only mounts when the container does
not need write access.
In default mode (host networking), the container shares the host's full network stack. A compromised container process can:
- Bind to any host port
- Connect to any host-accessible network
- Sniff traffic on host interfaces (via
CAP_NET_RAW) - Access host-local services on
127.0.0.1
With --private-network, --hardened, or --pod, the container is limited
to its own network namespace.
sdme has no persistent daemon. There is no equivalent of Docker's
containerd socket, which is a well-known privilege escalation vector
(access to the Docker socket is effectively root access). sdme communicates
with systemd over the system D-Bus, which is already present and secured
by its own policy.
sdme provides two convenience flags (--hardened and --strict) that
bundle multiple security layers. See
architecture.md, Section 14 for full
details on what each flag enables and its effects on host-rootfs
containers.
--hardened covers user namespace isolation, network isolation,
no_new_privs, and capability reduction. The remaining gaps versus
Docker/Podman defaults are:
- No default MAC confinement. Docker ships a default AppArmor profile.
sdme supports
--apparmor-profilebut--hardeneddoes not set one. - Less restrictive seccomp baseline. nspawn's allowlist permits
@mountand more syscall groups than Docker's OCI default profile. - More capabilities retained. Even after
--hardeneddrops 4 capabilities, 22 remain (includingCAP_SYS_ADMIN), compared to Docker's ~14.
These gaps are inherent to running a full init system inside the container.
For maximum restriction, use --strict.
+-----------------------------+---------------------------+---------------------------+
| Mechanism | sdme --strict | Docker default |
+-----------------------------+---------------------------+---------------------------+
| User namespace | Yes | Optional |
| Network namespace | Yes (loopback only) | Yes (bridge) |
| no_new_privs | Yes | Yes |
| Retained caps | ~14 (Docker - NET_RAW) | ~14 |
| Seccomp | nspawn baseline + 4 deny | OCI default (~44 blocked) |
| AppArmor | sdme-default | docker-default |
| CAP_SYS_ADMIN | Yes (for systemd) | No |
| Init in container | Full systemd | Single process |
+-----------------------------+---------------------------+---------------------------+
The one remaining difference is CAP_SYS_ADMIN and the @mount syscall
group, both required for the full init model. sdme's security philosophy
is that this is an acceptable trade-off for the operational benefits it
provides (see Section 1).
+-----------------------+----------------------------+----------------------------+----------------------------+
| Mechanism | sdme (nspawn) | Docker (runc, rootful) | Podman (crun, rootless) |
+-----------------------+----------------------------+----------------------------+----------------------------+
| PID namespace | Yes | Yes | Yes |
| IPC namespace | Yes | Yes | Yes |
| UTS namespace | Yes | Yes | Yes |
| Mount namespace | Yes | Yes | Yes |
| Network namespace | Optional (host default) | Yes (bridge default) | Yes (slirp4netns/pasta) |
| User namespace | Optional (--strict: yes) | Optional | Yes (default) |
| Cgroup namespace | Partial (Delegate=yes) | Yes | Yes |
| Capabilities | ~26 (--strict: ~15) | ~14, no SYS_ADMIN | Same as Docker |
| Seccomp | nspawn + optional filters | OCI default (~44 blocked) | Same as Docker |
| AppArmor | sdme-default (--strict) | Default profile | Where available |
| SELinux | None | svirt labels | Strong integration |
| no_new_privs | Optional (--strict: yes) | Yes (default) | Yes (default) |
| Read-only rootfs | Optional | Optional | Optional |
| Rootless | No (root-only) | Optional | Default |
| Daemon | None | containerd socket | None |
| Init in container | Full systemd (always) | Optional (--init) | Optional (--init) |
+-----------------------+----------------------------+----------------------------+----------------------------+
Each "Optional" cell means the feature is available but not on by default.
For sdme, --strict enables all security layers simultaneously. Individual
flags: --private-network or --hardened for network namespace, --userns
or --hardened for user namespace, --no-new-privileges or --hardened
for no_new_privs, --read-only for read-only rootfs, --apparmor-profile
for AppArmor, and --system-call-filter for additional seccomp rules.
sdme is appropriate when:
- You want a full systemd environment (service management, journald, cgroups).
- You want disposable containers that boot quickly, with no daemon.
- You are comfortable with root-level operation.
- You use
--strictfor Docker-equivalent security, or--hardenedfor defense-in-depth. - You want to extend containers with custom services alongside OCI workloads.
Docker or Podman is appropriate when:
- You need defense-in-depth out of the box for untrusted workloads.
- You need rootless execution (especially Podman).
- You need OCI-compatible image building and distribution workflows.
- You cannot accept
CAP_SYS_ADMINin any form. - You operate in a multi-tenant environment.
- Compliance requirements specify specific isolation standards.
Podman specifically when:
- Rootless is a hard requirement.
- SELinux integration is needed.
- You want a daemonless runtime with Docker-compatible CLI.
- You need Kubernetes-style pod semantics with full external connectivity.