Alexandre Fiori, February 2026
sdme is a container manager for Linux. It runs on top of systemd-nspawn and overlayfs, both already present on any modern systemd-based distribution.
No daemon, no runtime dependency beyond systemd itself. A single static binary that manages overlayfs layering and drives systemd over D-Bus.
The project started as an experiment inspired by virtme-ng: what if you could clone your running host system into an isolated container with a single command? Overlayfs makes this nearly free: mount the host rootfs as a read-only lower layer, give the container its own upper layer for writes, and you have a full-featured Linux environment that shares the host's binaries but can't damage the host's files. That was the seed.
From there it grew: importing rootfs from other distros, pulling OCI images from Docker Hub, building custom root filesystems, managing container lifecycle through D-Bus. Each piece turned out to be surprisingly tractable when you let claude (and systemd) do the heavy lifting.
sdme is not an attempt to replace Podman, Docker, or any other container runtime. Podman in particular has excellent systemd integration through Quadlet (podman-systemd). It is today's mature, full-featured approach to systemd-native container management. sdme is a different thing entirely. It boots full systemd inside nspawn containers, manages overlayfs layering directly, and bridges the OCI ecosystem, all without a daemon and without pulling in a container runtime.
You get the operational model of systemd (journalctl, systemctl, resource limits) with the packaging model of OCI (Docker Hub, GHCR, Quay).
The name stands for Systemd Machine Editor, and its pronunciation is left as an exercise for the reader.
The foundational mode of sdme is what you get when you run sdme new
with no flags. It creates a container that is an overlayfs clone of your
running host system: same kernel, same binaries, same libraries, but
with its own writable layer so changes stay isolated.
+------------------------------------------------------+
| HOST SYSTEM |
| +--------+ +--------+ +------------------+ |
| | kernel | | systemd| | root filesystem | |
| | | | D-Bus | | (/) | |
| +--------+ +----+---+ +--------+---------+ |
| | | overlayfs |
| +----------+---------------+ (lower) |
| | | |
| +-----+----------------+ +-----+----------------+ |
| | container A | | container B | |
| | +----------------+ | | +----------------+ | |
| | | systemd | | | | systemd | | |
| | | D-Bus | | | | D-Bus | | |
| | +----------------+ | | +----------------+ | |
| | upper/ work/ | | upper/ work/ | |
| | merged/ | | merged/ | |
| +----------------------+ +----------------------+ |
+------------------------------------------------------+
Note: /etc/systemd/system and /var/log are opaque
by default -- the container sees empty directories
there, not the host's units and logs.
Looks like this:
$ sudo sdme new
creating 'jerinhouma'
starting 'jerinhouma'
joining 'jerinhouma'
host rootfs container: joining as user 'fiorix' with opaque dirs /etc/systemd/system, /var/log
Connected to machine jerinhouma. Press ^] three times within 1s to exit session.
jerinhouma ~ $ systemctl status
● jerinhouma
State: running
Units: 185 loaded (incl. loaded aliases)
Jobs: 0 queued
Failed: 0 units
Since: Thu 2026-02-26 01:15:38 GMT; 7s ago
systemd: 257.9-0ubuntu2.1
...
Each container boots its own systemd instance inside the nspawn namespace. The host's systemd manages the container as a service unit; the container's systemd manages everything inside. Both talk to their own D-Bus, both write to their own journal, but the container's writes land on the overlayfs upper layer and never touch the host - well, except that they are stored on the host. This matters, and the details are explained in Section 14.
User identity preservation. When sdme is run via sudo and the
container is a host-rootfs clone (no -r flag), it reads $SUDO_USER
and joins the container as that user rather than root. This is a
convenience for landing on your own machine as your own user: your own
$HOME, dotfiles, shell, etc. The behaviour is controlled by the
join_as_sudo_user config setting (enabled by default) and can be
disabled with sdme config set join_as_sudo_user no.
Opaque directories are the key to making host clones usable.
Without them, the container would inherit the host's systemd units and
try to start all the same services. By default, sdme marks
/etc/systemd/system and /var/log as overlayfs-opaque (via the
trusted.overlay.opaque xattr), making the container see empty
directories there. The host's units and logs are hidden; the container
starts clean.
DNS requires special treatment because containers share the host's
network namespace by default. The host's systemd-resolved already
owns 127.0.0.53, so the container's copy would fail to bind. For
imported rootfs, sdme masks systemd-resolved directly in the rootfs
during sdme fs import. For host-rootfs containers (no -r), the mask
goes in the overlayfs upper layer during create. Either way, the mask
is a symlink to /dev/null in /etc/systemd/system/. This causes the
container's NSS resolve module to return UNAVAIL, falling through to
the dns module which reads /etc/resolv.conf. A placeholder regular
file is written there so systemd-nspawn --resolv-conf=auto can
populate it at boot.
Everything sdme knows lives under /var/lib/sdme (the default
datadir, configurable via sdme config set datadir <path>):
/var/lib/sdme/
|-- state/
| |-- container-a # KEY=VALUE metadata
| +-- container-b
|-- containers/
| |-- container-a/
| | |-- upper/ # CoW writes
| | |-- work/ # overlayfs workdir
| | +-- merged/ # mount point
| +-- container-b/
| +-- ...
|-- fs/
| |-- ubuntu/ # imported rootfs
| |-- fedora/
| |-- nginx/ # OCI app rootfs
| +-- .ubuntu.meta # distro metadata
|-- volumes/
| +-- my-container/
| |-- var-lib-mysql/ # OCI-declared volume data
| +-- data/
+-- pods/
+-- my-pod/
+-- state # KEY=VALUE pod state (CREATED=...)
Pod runtime state (volatile, recreated after reboot):
/run/sdme/pods/
+-- my-pod/
+-- netns # bind-mount of the pod's netns fd
State files are flat KEY=VALUE text files under state/. They
record everything about a container: name, rootfs, creation timestamp,
resource limits, network configuration, bind mounts, environment
variables, opaque directories, and pod membership (POD and/or
OCI_POD). The format is intentionally simple: readable with cat,
parseable with grep, editable with sed in an emergency. The State
type in the code uses a BTreeMap<String, String> for deterministic
key ordering.
Transactional operations follow a staging pattern throughout sdme.
Rootfs imports write to a .{name}.importing staging directory, then
do an atomic rename() to the final path on success. If the import
fails or is interrupted, the staging directory is cleaned up, and if
it's left behind (power failure, OOM kill), the next import detects it
and offers to clean up with --force.
Health detection in sdme ps checks that a container's expected
directories actually exist and that its rootfs (if specified) is
present. A container whose rootfs has been removed shows as broken.
OS detection reads /etc/os-release from the rootfs to show distro
names in the listing.
Conflict detection checks three places before accepting a name: the
sdme state directory, /var/lib/machines/ (systemd's own machine
directory), and the list of currently registered machines via D-Bus.
This prevents collisions with both sdme containers and any other nspawn
machines on the system.
create --> start --> join/exec --> stop --> rm
| | | |
| install/update graceful: KillMachine SIGRTMIN+4
| template unit --term: TerminateMachine
| StartUnit (D-Bus) --kill: KillMachine SIGKILL
| wait for boot
|
+-- mkdir upper/ work/ merged/
+-- mask systemd-resolved (host-rootfs only; imported rootfs patched at import)
+-- write /etc/resolv.conf placeholder
+-- set opaque dirs (xattr)
+-- write state file
create builds the overlayfs directory structure, sets up DNS, applies opaque directories, validates the umask (a restrictive umask would make overlayfs files inaccessible to non-root services like dbus-daemon), and writes the state file. It does not start the container.
start installs (or updates) a systemd template unit and
per-container drop-in, then calls StartUnit over D-Bus. After the
unit starts, sdme waits for the container to reach the running state
by subscribing to machined D-Bus signals and polling the machine
state. The boot timeout defaults to 60 seconds and is configurable.
join and exec spawn machinectl shell as a child process and
forward its exit status. This was a deliberate choice: machinectl
handles the namespace entry, PAM session setup, and environment
correctly, and reimplementing that logic in Rust would buy nothing.
Spawning (rather than exec'ing) keeps sdme alive so it can inspect the
exit code and clean up on failure (particularly important for sdme new, which removes the container if the join fails). The balance struck
is: use D-Bus where it gives us programmatic control (start, stop,
status queries), shell out where the existing tool already does the job
well (interactive shell sessions, running commands).
stop has three tiers: graceful (default) sends SIGRTMIN+4 to the
container leader via KillMachine (90s timeout), --term calls
TerminateMachine which sends SIGTERM to the nspawn process (30s
timeout), and --kill sends SIGKILL to all processes via KillMachine
(15s timeout). Multiple containers can be stopped in one invocation.
rm stops the container if running, removes the state file, and
deletes the container's directories. The make_removable() helper
recursively fixes permissions before deletion, since containers can
create files owned by arbitrary UIDs with restrictive modes, and
remove_dir_all() would fail without this.
Boot failure cleanup differs between sdme new and sdme start.
If sdme new fails to boot or join the container (or is interrupted
with Ctrl+C), it removes the just-created container entirely (the user
never asked for a stopped container, they asked for a running one). If
sdme start fails, it stops the container but preserves it on disk for
debugging.
When you don't specify a name, sdme generates one from a wordlist of 200 Tupi-Guarani words and variations. The choice is an easter egg, a nod to the indigenous languages of Brazil - our roots!
Name generation shuffles the wordlist (Fisher-Yates, seeded from
/dev/urandom), checks each candidate against the three-way conflict
detection (state files, /var/lib/machines/, registered machines), and
returns the first unused name. If all 200 base words are taken, it
falls back to vowel mutations: consonants stay fixed while vowels are
randomly substituted, producing names that sound like plausible
Tupi-Guarani words but don't appear in the original list. Up to 200
mutation attempts before giving up. Sorry folks yet I digress,
inspiration from MF-DOOM makes me handle these nerdy things just like
chess.
The fs subsystem manages the catalogue of root filesystems that
containers are built from. Each rootfs is a plain directory under
/var/lib/sdme/fs/ containing a complete Linux filesystem tree.
Containers reference them by name.
sdme fs import auto-detects the source type by probing in order:
- URL --
http://orhttps://prefix. Downloads the file, then extracts as a tarball. - OCI registry -- looks like a domain with a path
(e.g.
docker.io/ubuntu:24.04,quay.io/fedora/fedora). Pulled via the OCI Distribution Spec. - Directory -- path is a directory. Copied with
copy_tree()preserving ownership, permissions, xattrs, and special files. - QCOW2 image -- magic bytes
QFI\xfbat the start of the file. Mounted read-only viaqemu-nbd, then copied withcopy_tree(). - Raw disk image -- MBR/GPT signature or
.raw/.imgextension. Sameqemu-nbdpath as QCOW2. - Tarball -- default fallback for any other file. Extracted with native Rust crates; compression is detected from magic bytes, not the file name.
Permissions and ownership must be preserved exactly. A rootfs
contains files owned by dozens of system UIDs (root, messagebus,
systemd-network, nobody, etc.) with specific modes. The copy_tree()
function uses lchown() for ownership, chmod() for permissions, and
utimensat() with nanosecond precision for timestamps. All operations
use l-prefixed variants (lstat, lchown, lgetxattr) to avoid following
symlinks.
Extended attributes carry security and filesystem metadata.
copy_xattrs() lists and copies all xattrs except security.selinux
(which doesn't transfer meaningfully between filesystems). The overlayfs
trusted.overlay.opaque xattr is preserved when present.
Special files (block devices, character devices, FIFOs, and Unix
sockets) are recreated with mknod() and mkfifo() using the original
mode and device numbers. This matters for rootfs that include
/dev/null, /dev/zero, and friends.
Compression auto-detection uses magic bytes rather than file extensions. The first few bytes of a file reveal its compression format:
| Magic bytes | Format |
|---|---|
1f 8b |
gzip |
BZh |
bzip2 |
fd 37 7a 58 5a 00 |
xz |
28 b5 2f fd |
zstd |
This means sdme fs import ubuntu rootfs.tar.zst works even if the
file is named rootfs.tar, because the content, not the name,
determines the decompressor.
Systemd detection runs after import. If the rootfs doesn't contain
systemd and dbus (both required for nspawn containers), sdme can install
them automatically: it detects the distro family from /etc/os-release
and runs the appropriate package manager (apt, dnf) in a chroot.
The --install-packages flag controls this: auto prompts
interactively, yes always installs, no refuses if systemd is
absent.
Rootfs patching runs after systemd detection. Even when systemd is
present, sdme patches the rootfs for nspawn compatibility:
systemd-resolved is masked (containers share the host's network
namespace and cannot bind 127.0.0.53), systemd-logind is unmasked if
masked (some OCI images like CentOS/AlmaLinux mask it, but
machinectl shell requires logind), and missing packages needed by
machinectl shell (e.g. util-linux and pam on RHEL-family, which
provide /etc/pam.d/login) are installed via chroot. These patches are
applied directly to the rootfs so all containers created from it inherit
a working environment without overlayfs-level fixups.
Staging areas and atomic operations ensure that a failed import
doesn't leave a half-written rootfs. The staging directory
.{name}.importing is renamed to the final location only on complete
success. If sdme finds a leftover staging directory on the next run, it
reports it and the --force flag cleans it up.
Cooperative Ctrl+C runs throughout the import pipeline. A global
INTERRUPTED flag is set by a POSIX signal handler (installed with
sigaction, deliberately without SA_RESTART so blocking reads return
EINTR). The import loop checks this flag between operations, allowing
clean cancellation of multi-gigabyte downloads and extractions.
sdme fs build takes a Dockerfile-like config and produces a new
rootfs:
FROM ubuntu
COPY ./my-app /opt/my-app
RUN apt-get update && apt-get install -y libssl3
RUN systemctl enable my-app.service
The build engine creates a staging container from the FROM rootfs, then processes operations sequentially. The key insight is that COPY and RUN have different requirements:
- COPY writes directly to the overlayfs upper layer while the container is stopped. This is a filesystem operation; no running container needed.
- RUN executes a command inside the container via
machinectl shell. The container must be running.
The engine starts and stops the container as needed: if it encounters a RUN after a COPY, it starts the container; if it encounters a COPY after a RUN, it stops it first. This means a config with alternating COPY and RUN operations will start and stop the container multiple times, but in practice most configs group their COPYs at the top and RUNs at the bottom.
Path sanitisation rejects COPY destinations under directories that
systemd mounts tmpfs over at boot (/tmp, /run, /dev/shm). Files
written to the overlayfs upper layer in these locations would be hidden
by the tmpfs mount when the container starts, a silent data loss that's
hard to debug. Destinations under overlayfs-opaque directories are also
rejected. Errors include the config file path and line number for easy
debugging.
After all operations complete, the engine mounts the overlayfs manually (the container is stopped), copies the merged view to a staging rootfs directory, and does an atomic rename to the final location. The staging container is cleaned up regardless of success or failure.
The goal isn't to replace Docker or Podman. It's to give systemd-nspawn users a way to tap into the OCI ecosystem without leaving the systemd operational model.
OCI registry pulling implements the OCI Distribution Spec directly, no shelling out. The flow is: parse the image reference, probe the registry for auth requirements, obtain a bearer token if needed, fetch the manifest (resolving manifest lists by architecture), then download and extract layers in order.
Layer extraction handles OCI whiteout markers: .wh.<name> deletes a
file from the previous layer, .wh..wh..opq clears an entire
directory. Tar paths are sanitised (leading / is stripped and ..
components are rejected) to prevent path traversal escaping the
destination directory.
When importing an OCI image, sdme classifies it as either a base OS image or an application image based on the image config:
| Classification | Criteria | Result |
|---|---|---|
| Base OS image | No entrypoint, shell default command, no ports | Extracted as a first-class sdme rootfs |
| Application image | Has entrypoint, non-shell command, or exposed ports | Placed under /oci/root inside a base rootfs |
Application images require a base rootfs specified with --base-fs:
sudo sdme fs import ubuntu docker.io/ubuntu:24.04 -v --install-packages=yes
sudo sdme fs import nginx docker.io/nginx --base-fs=ubuntu -vSet a default to avoid repeating --base-fs:
sudo sdme config set default_base_fs ubuntuThe --oci-mode flag overrides auto-detection:
| Flag | Behavior |
|---|---|
--oci-mode=auto |
Auto-detect from image config (default) |
--oci-mode=base |
Force base OS mode |
--oci-mode=app |
Force application mode (requires --base-fs) |
Base OS import (debian, ubuntu, fedora) is straightforward: extract the rootfs and install systemd if missing. The result is a first-class sdme rootfs.
Application import produces what sdme calls a capsule: a copy of
a base OS rootfs with the OCI application rootfs placed under
/oci/root and a generated systemd service unit that chroots into it.
+------------------------------------------------------+
| HOST SYSTEM |
| kernel . systemd . D-Bus . machined |
| | |
| sdme@name.service (nspawn) |
| +------------------+----------------------+ |
| | CAPSULE (container) | |
| | | |
| | systemd . D-Bus . journald | |
| | | |
| | +-----------------------------------+ | |
| | | sdme-oci-app.service | | |
| | | RootDirectory=/oci/root | | |
| | | | | |
| | | +-----------------------------+ | | |
| | | | OCI process | | | |
| | | | (nginx, mysql, ...) | | | |
| | | +-----------------------------+ | | |
| | +-----------------------------------+ | |
| | | |
| | /oci/env -- environment vars | |
| | /oci/ports -- exposed ports | |
| | /oci/volumes -- declared volumes | |
| +-----------------------------------------+ |
+------------------------------------------------------+
The generated sdme-oci-app.service unit uses
RootDirectory=/oci/root to chroot the process, MountAPIVFS=yes to
provide /proc, /sys, /dev, and EnvironmentFile=-/oci/env to
load the image's environment variables. The unit is enabled via symlink
in multi-user.target.wants/ so it starts automatically when the
container boots.
The capsule model means OCI applications get the full systemd
operational model for free: journalctl -u sdme-oci-app for logs,
systemctl restart for restarts, cgroup resource limits from the host.
The application doesn't know or care that it's inside an nspawn
container; it sees a chroot with API filesystems, exactly what it
expects.
Many OCI application images declare a non-root User in their config
(e.g. nginx runs as UID 101). systemd's User= directive resolves
usernames via NSS before entering the RootDirectory= chroot, so
users that only exist inside the OCI rootfs cause exit code 217/USER.
This is a known upstream limitation (see
systemd#12498,
systemd#19781).
sdme solves this with a generated static ELF binary (drop_privs,
under 1 KiB) that performs setgroups(0,NULL) then setgid then
setuid then chdir then execve, all via raw syscalls with no libc
or NSS dependency. The binary is written to /.sdme-drop-privs inside
the OCI root at import time (mode 0o111, owned by root). For non-root
users the generated unit becomes:
ExecStart=/.sdme-drop-privs <uid> <gid> <workdir> <entrypoint> [args...]No User= or WorkingDirectory= directives are needed; the binary
handles both. For root users, the standard User=root unit is generated
unchanged.
The OCI User field is resolved at import time against etc/passwd
and etc/group inside the OCI rootfs, supporting named users, numeric
UIDs, explicit groups, and uid:gid pairs. Full details in
docs/hacks.md.
OCI images commonly symlink log files to /dev/stdout or
/dev/stderr (e.g. /var/log/nginx/error.log -> /dev/stderr -> /proc/self/fd/2). Under Docker, fds 1/2 are pipes, so open() on
/proc/self/fd/N succeeds. Under systemd, fds 1/2 are journal
sockets, and the kernel rejects open() on socket-backed
/proc/self/fd/N with ENXIO.
sdme solves this with a generated LD_PRELOAD shared library
(devfd_shim, approximately 4 KiB) that intercepts
open()/openat() at the libc symbol level. When the path matches
/dev/stdin, /dev/stdout, /dev/stderr, /dev/fd/{0,1,2}, or
/proc/self/fd/{0,1,2}, the interceptor returns dup(N) instead of
calling the real open. All other paths fall through to the real
syscall.
The shim is written to /.sdme-devfd-shim.so inside the OCI root at
import time, and the generated unit includes
Environment=LD_PRELOAD=/.sdme-devfd-shim.so. This applies to all OCI
containers (both root and non-root users). Full details in
docs/hacks.md.
OCI images declare exposed ports and volumes in their image config.
sdme reads these from /oci/ports and /oci/volumes in the rootfs
and auto-wires them at container creation time.
Port forwarding. When the container uses a private network namespace
(--private-network, --hardened, or --strict), sdme reads
/oci/ports and adds --port rules mapping each declared port to the
same host port. User --port flags take priority; matching ports are
skipped. On host-network containers, no forwarding is needed (services
bind directly to the host), so sdme prints an informational message
instead. Suppressed with --no-oci-ports.
Volume mounts. For each volume declared in /oci/volumes, sdme
creates a host-side directory at
{datadir}/volumes/{container}/{volume-name} and adds a bind mount
mapping it to /oci/root{volume-path} inside the container. Volume
data survives container removal (sdme prints the path but does not
delete it). User --bind flags take priority: if the user binds to the
same container path, the auto-mount is skipped. Suppressed with
--no-oci-volumes.
Remaining caveat. User-specified bind mounts (-b) and environment
variables (-e) on the container operate at the nspawn level, not
inside the RootDirectory=/oci/root chroot where the OCI application
runs. OCI-declared volumes and ports are handled correctly because sdme
maps volumes into the /oci/root subtree and ports operate at the
network namespace level.
OCI image environment variables are stored in /oci/env inside the
rootfs (loaded via EnvironmentFile=-/oci/env in the generated unit).
Runtime-only variables (e.g. MYSQL_ROOT_PASSWORD) must be added
manually before first boot:
echo 'MYSQL_ROOT_PASSWORD=secret' | sudo tee -a /var/lib/sdme/fs/mysql/oci/env- One OCI service per container. Each rootfs generates a single
sdme-oci-app.service. - Environment variables need manual setup. Runtime-only variables
must be added to
/oci/envbefore first boot. - No health checks. OCI HEALTHCHECK directives are ignored.
- No restart policy mapping. OCI restart policies don't map to systemd; the generated unit uses systemd defaults.
At this point this is all very exploratory. This journey is 1% complete.
By default, containers share the host's network namespace: same
interfaces, same addresses, same ports. This is the simplest mode and
what you get with sdme new out of the box.
For isolation, --private-network gives the container its own network
namespace with no connectivity. The remaining network flags build on
top of it and all imply --private-network automatically:
--private-network: isolated network namespace, no connectivity--network-veth: creates a virtual ethernet link between host and container--network-bridge <name>: connects the container's veth to a host bridge--network-zone <name>: joins a named zone for inter-container networking--port / -p <[PROTO:]HOST[:CONTAINER]>: forwards a port (TCP by default, repeatable)
These flags are available on both sdme create and sdme new. They
compose freely: you can combine --network-zone with --port, or use
--network-bridge with --network-veth. Port forwarding requires a
private network namespace because systemd-nspawn's --port flag only
works when host and container have separate network stacks.
Under the hood, each flag translates directly to a systemd-nspawn
argument. The network configuration is persisted in the container's
state file (PRIVATE_NETWORK, NETWORK_VETH, NETWORK_BRIDGE,
NETWORK_ZONE, PORTS) and written into the per-container nspawn
drop-in at start time.
Bridge and zone names are validated (alphanumeric, hyphens,
underscores). Port specs are validated for format
([PROTO:]HOST[:CONTAINER]) and range (1-65535).
Pods give multiple containers a shared network namespace so they can reach each other on localhost. The pattern is the same as Kubernetes pods: one network namespace, multiple containers.
There are two mechanisms for joining a pod, serving different use cases:
--pod (whole-container): The entire nspawn container runs in the
pod's network namespace. All processes (init, services, everything)
share the pod's network stack. This is the general-purpose option for
any container type.
--oci-pod (OCI app only): The pod's netns is bind-mounted into
the container and only the OCI app service process enters it via a
systemd NetworkNamespacePath= drop-in. The container's init and other
services remain in their own network namespace. This is for OCI app
containers that need pod networking for their application process but
want systemd's own networking to remain independent. Requires an OCI
app rootfs.
Both flags can be combined on the same container (e.g.
--pod=X --oci-pod=Y with different pods).
--pod (whole container in pod netns):
+------------------------------------------------------+
| HOST SYSTEM |
| kernel . systemd . D-Bus . machined |
| |
| /run/sdme/pods/my-pod/netns (netns bind-mount) |
| | |
| | +-- loopback only (127.0.0.1) --+ |
| | | | |
| +-----+----------+ +------------------+--+ |
| | container A | | container B | |
| | (db :5432) | | (app :8080) | |
| | nspawn --network| | nspawn --network | |
| | -namespace- | | -namespace- | |
| | path=pod-netns | | path=pod-netns | |
| +-----------------+ +--------------------+ |
+------------------------------------------------------+
Both containers see 127.0.0.1:5432 (db) and
127.0.0.1:8080 (app) because they share the netns.
sdme pod new <name> creates the pod:
- Validates the name (same rules as container names).
- Calls
unshare(CLONE_NEWNET)to create a new network namespace. - Brings up the loopback interface (
ioctl SIOCSIFFLAGS IFF_UP). - Bind-mounts
/proc/self/ns/netto/run/sdme/pods/<name>/netns. - Restores the original network namespace via
setns(). - Writes a persistent state file at
{datadir}/pods/<name>/state.
The runtime bind-mount under /run is volatile and disappears on
reboot. When a container referencing a pod is started,
ensure_runtime() checks whether the bind-mount still exists and
recreates the netns if needed.
sdme pod ls lists all pods with their active/inactive status.
sdme pod rm <name> unmounts the netns, removes the runtime directory,
and deletes the persistent state directory. Removal is refused if any
container still references the pod via POD or OCI_POD keys
(override with --force).
--pod=<name> on sdme create or sdme new:
- Works with any container type (host-rootfs or imported rootfs).
- Incompatible with
--usernsand--hardened: the kernel blockssetns(CLONE_NEWNET)from a child user namespace into the pod's netns (owned by the init userns). Use--oci-podfor hardened pods. - Compatible with
--private-network(without userns): the pod's netns already provides equivalent isolation (loopback only), so--private-networkis automatically omitted from the nspawn invocation when a pod is used. - Stores
POD=<name>in the container's state file. - At start time, the nspawn drop-in includes
--network-namespace-path=/run/sdme/pods/<name>/netns, which makes the entire container (including init and all services) run inside the pod's network namespace.
--oci-pod=<name> on sdme create or sdme new:
- Requires an OCI app rootfs (
sdme-oci-app.servicemust exist in the rootfs). - Stores
OCI_POD=<name>in the container's state file. - At create time, writes a systemd drop-in inside the overlayfs upper
layer at
upper/etc/systemd/system/sdme-oci-app.service.d/oci-pod-netns.confwithNetworkNamespacePath=/run/sdme/oci-pod-netns. - At start time, the nspawn drop-in includes
--bind-ro=/run/sdme/pods/<name>/netns:/run/sdme/oci-pod-netns, which makes the pod's netns available inside the container. The inner drop-in makes only the OCI app service process enter the pod's netns.
sdme ps shows a POD column when any container has a --pod
assignment, an OCI-POD column when any container has an --oci-pod
assignment, a USERNS column when any container has --userns enabled,
and a BINDS column when any container has bind mounts configured.
The pod netns contains only a loopback interface with no routes. Containers in the pod can communicate via localhost but have no external connectivity unless a veth or bridge is added to the netns externally.
Inside the container, systemd-nspawn's default capability set drops
CAP_NET_ADMIN, so the container's root cannot add interfaces or
change routes in the shared netns. CAP_SYS_ADMIN is present
(required by systemd inside nspawn) but the container's PID namespace
prevents access to host process netns references. The OCI app process
itself (running as a non-root UID) has zero effective capabilities.
sdme exposes three cgroup-based resource controls:
| Flag | systemd property | Example |
|---|---|---|
--memory <size> |
MemoryMax= |
--memory 2G |
--cpus <count> |
CPUQuota= |
--cpus 0.5 (50%), --cpus 2 (200%) |
--cpu-weight <1-10000> |
CPUWeight= |
--cpu-weight 100 |
These flags are available on sdme create, sdme new, and sdme set.
They are applied via a systemd drop-in file (limits.conf) installed
alongside the container's service unit. A daemon-reload is triggered
when the drop-in changes.
sdme set replaces all limits at once: flags not specified are removed.
If the container is running, sdme prints a note that a restart is needed
for the new limits to take effect.
Memory values accept K/M/G/T suffixes. CPU count is a positive float
where 1 means one full core and 0.5 means half a core (converted to
systemd's CPUQuota= percentage internally). CPU weight is an integer
from 1 to 10000, controlling relative scheduling priority when CPUs are
contended.
Custom bind mounts and environment variables are set at creation time
with -b/--bind and -e/--env, available on both sdme create
and sdme new.
Bind mounts use the format HOST:CONTAINER[:ro]. Read-write is the
default; append :ro for read-only. Both paths must be absolute with
no .. components, and the host path must exist at creation time. At
start time, each bind becomes a --bind= or --bind-ro= argument to
systemd-nspawn.
Environment variables use the format KEY=VALUE. Keys must be
alphanumeric or underscore (no leading digit). At start time, each
variable becomes a --setenv=KEY=VALUE argument to systemd-nspawn.
Both are stored in the container's state file (BINDS= and ENVS=,
pipe-separated) and reconstituted into nspawn arguments on every start.
sdme stores its settings in a TOML file at ~/.config/sdme/sdmerc:
| Setting | Default | Description |
|---|---|---|
interactive |
true |
Enable interactive prompts |
datadir |
/var/lib/sdme |
Root directory for all container and rootfs data |
boot_timeout |
60 |
Seconds to wait for container boot before giving up |
join_as_sudo_user |
true |
Join host-rootfs containers as $SUDO_USER instead of root |
host_rootfs_opaque_dirs |
/etc/systemd/system,/var/log |
Default opaque dirs for host-rootfs containers (empty string disables) |
Settings are read with sdme config get and written with
sdme config set <key> <value>.
Sudo interplay. Since sdme runs as root via sudo, the config file
lookup checks whether $SUDO_USER is set to a non-root user. If so,
and if that user's ~/.config/sdme/sdmerc exists, it is used instead
of root's config. This means the invoking user's preferences are
respected without requiring the config to live under /root. An
explicit -c/--config flag overrides all resolution logic.
Config files are written with mode 0600 and directories with mode
0700.
This section documents sdme's security implementation: capabilities,
seccomp, AppArmor, the --hardened and --strict flags, and input
sanitization. For comparisons with Docker and Podman, see
docs/security.md.
systemd-nspawn retains 26 capabilities by default:
CAP_AUDIT_CONTROL CAP_AUDIT_WRITE CAP_CHOWN
CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_FOWNER
CAP_FSETID CAP_IPC_OWNER CAP_KILL
CAP_LEASE CAP_LINUX_IMMUTABLE CAP_MKNOD
CAP_NET_BIND_SERVICE CAP_NET_BROADCAST CAP_NET_RAW
CAP_SETFCAP CAP_SETGID CAP_SETPCAP
CAP_SETUID CAP_SYS_ADMIN CAP_SYS_BOOT
CAP_SYS_CHROOT CAP_SYS_NICE CAP_SYS_PTRACE
CAP_SYS_RESOURCE CAP_SYS_TTY_CONFIG
CAP_NET_ADMIN is added only when --private-network is active, since
it is safe to grant when the container has its own network namespace
(changes only affect the isolated namespace, not the host).
CAP_SYS_ADMIN is the most significant capability in this set. It is
required for systemd to function inside the container: mounting
filesystems, configuring cgroups, managing namespaces for its own
services. This cannot be dropped without breaking the
systemd-inside-nspawn model.
Notable exclusions: CAP_SYS_MODULE (no kernel module loading),
CAP_SYS_RAWIO (no raw I/O port access), CAP_SYS_TIME (no system
clock modification), CAP_BPF (no BPF program loading),
CAP_SYSLOG, and CAP_IPC_LOCK.
sdme provides fine-grained capability management:
--drop-capability CAP_X: drop individual capabilities from nspawn's default set. Accepts names with or without theCAP_prefix.--capability CAP_X: add capabilities not in the default set (e.g.CAP_NET_ADMINfor containers with--private-network).
Both flags are repeatable and validated against a known set of Linux capabilities. Specifying the same capability in both is rejected as contradictory.
nspawn applies a built-in allowlist-based seccomp filter. Syscalls not
on the allowlist are blocked with EPERM (for known syscalls) or
ENOSYS (for unknown ones).
Allowed by default: @basic-io, @file-system, @io-event, @ipc,
@mount, @network-io, @process, @resources, @setuid, @signal,
@sync, @timer, and about 50 individual syscalls.
Blocked unconditionally: kexec_load, kexec_file_load,
perf_event_open, fanotify_init, open_by_handle_at, quotactl,
the @swap group, and the @cpu-emulation group.
Capability-gated: @clock requires CAP_SYS_TIME, @module requires
CAP_SYS_MODULE, @raw-io requires CAP_SYS_RAWIO. Since none of
these capabilities are in the default bounding set, these syscall groups
are effectively blocked.
--system-call-filter layers additional seccomp filters on top of
nspawn's baseline. It uses systemd's group syntax:
@group: allow a syscall group~@group: deny a syscall group
sdme create mybox --system-call-filter ~@raw-io
sdme create mybox --system-call-filter ~@cpu-emulation
The flag is repeatable. Note that ~@mount breaks systemd inside the
container, the same reason nspawn allows it in the first place.
sdme ships a default AppArmor profile (sdme-default) designed for
systemd-nspawn system containers. The profile allows the operations
required for systemd boot (mount, pivot_root, signal, unix sockets)
while denying dangerous host-level access (raw device I/O, /proc
sysctl writes, kernel module paths). It is applied via
AppArmorProfile= in the systemd service unit drop-in.
The profile is automatically applied by --strict. It can also be used
standalone:
sdme create mybox --apparmor-profile sdme-default
To install the profile:
sdme config apparmor-profile > /etc/apparmor.d/sdme-default
apparmor_parser -r /etc/apparmor.d/sdme-default
The deb and rpm packages install and load the profile automatically.
SELinux is not supported. sdme has no SELinux integration. During
rootfs import (sdme fs import), security.selinux extended
attributes are explicitly skipped because they do not transfer
meaningfully between filesystems.
--no-new-privileges passes --no-new-privileges=yes to nspawn.
Off by default because interactive containers typically want sudo/su
to work; no_new_privs blocks privilege escalation via setuid binaries
and file capabilities. Enabled by --hardened and --strict.
--read-only makes the overlayfs merged view read-only.
Applications needing writable areas use bind mounts (-b).
--hardened is sdme's one-flag defense-in-depth bundle. It enables
multiple security layers at once:
- User namespace isolation (
--private-users=pick --private-users-ownership=auto): container root maps to a high unprivileged UID on the host. - Private network namespace (
--private-network): the container gets its own network namespace with loopback only. --no-new-privileges=yes: blocks privilege escalation via setuid binaries and file capabilities.- Drops capabilities:
CAP_SYS_PTRACE,CAP_NET_RAW,CAP_SYS_RAWIO,CAP_SYS_BOOT, reducing from 26 to 22 retained capabilities.
sdme create mybox --hardened
sdme new mybox --hardened
When cloning the host rootfs (no -r flag), --hardened has
several visible effects because the container inherits the host's
installed binaries and enabled services:
- No internet.
--private-networkgives the container only a loopback interface. Host services that assume network access (sshd, NTP, avahi, etc.) will fail or retry indefinitely. sudo/susilently fail.--no-new-privilegesprevents setuid escalation. The binaries exist and appear normal, but the kernel blocks the privilege transition.pingand raw-socket tools fail.CAP_NET_RAWis dropped.strace/gdbfail.CAP_SYS_PTRACEis dropped.
sdme new prints notes about --private-network and
--no-new-privileges when cloning the host rootfs. For imported rootfs
(e.g. -r ubuntu), these effects are less surprising because the rootfs
was built for container use.
For a host-rootfs container where these side effects matter, apply individual flags instead:
sdme create mybox --userns --drop-capability CAP_SYS_RAWIO
Composable with fine-grained flags. --hardened sets a baseline
that individual flags can override or extend:
sdme create mybox --hardened --capability CAP_NET_RAW # re-enable a dropped cap
sdme create mybox --hardened --system-call-filter ~@raw-io # add seccomp filter
sdme create mybox --hardened --apparmor-profile myprofile # add MAC confinement
sdme create mybox --hardened --read-only # read-only rootfs
Configurable. The capabilities dropped by --hardened are
controlled by the hardened_drop_caps config key:
sdme config set hardened_drop_caps CAP_SYS_PTRACE,CAP_NET_RAW
--strict closes the gaps between --hardened and Docker/Podman
defaults. It implies --hardened and adds:
- Aggressive capability drops: retains only the ~14 capabilities
Docker grants (AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID,
KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID,
SYS_CHROOT) plus
CAP_SYS_ADMIN(required for systemd init). Also dropsCAP_NET_RAW(carried over from--hardened, stricter than Docker). Drops 27 capabilities total. - Seccomp filters: denies
@cpu-emulation,@debug,@obsolete, and@raw-iosyscall groups on top of nspawn's baseline filter. - AppArmor profile: applies the
sdme-defaultprofile, which confines/proc//syswrites and raw device access at the MAC level.
sdme create mybox --strict
sdme new mybox --strict
When cloning the host rootfs, --strict compounds the effects of
--hardened with additional restrictions:
systemd-networkdandNetworkManagerfail.CAP_NET_ADMINis dropped.systemd-timesyncdcannot set the clock.CAP_SYS_TIMEis dropped.- Logging to
/dev/kmsgis denied.CAP_SYSLOGis dropped. - Nice/priority adjustments fail.
CAP_SYS_NICEis dropped. - AppArmor profile must be installed first. The
sdme-defaultprofile is checked atstarttime; if it is not loaded, the container fails to start with instructions for installation.
For host-rootfs use, --hardened with selective additions is often
more practical than --strict:
sdme create mybox --hardened --system-call-filter ~@raw-io
--strict is best suited for imported rootfs images where the service
set is known and controlled.
Why CAP_SYS_ADMIN is retained. CAP_SYS_ADMIN is required for
systemd to function inside the container. It needs to mount filesystems,
configure cgroups, and manage namespaces for its own services. With user
namespace isolation (enabled by --strict), CAP_SYS_ADMIN is scoped
to the user namespace. It does not grant host-level SYS_ADMIN. A process
that escapes the container lands in an unprivileged context on the host.
Composable with fine-grained flags. Like --hardened, --strict
sets a baseline that individual flags can override:
sdme create mybox --strict --capability CAP_NET_RAW # re-enable a dropped cap
sdme create mybox --strict --read-only # add read-only rootfs
sdme create mybox --strict --apparmor-profile custom # use a custom profile
sdme runs as root and handles untrusted input: tarballs from the internet, OCI images from public registries, QCOW2 disk images from unknown sources. Several hardening measures are in place:
Path traversal prevention. OCI layer tar paths are sanitised before
extraction: .. components are rejected and leading / is stripped.
Whiteout marker handling verifies (via canonicalize()) that the target
path stays within the destination directory before deleting anything.
Digest validation. OCI blob digests (sha256:abc123...) are
validated for safe characters (alphanumeric and hex only) and correct
length (64 chars for SHA-256, 128 for SHA-512) before being used to
construct filesystem paths. A malicious manifest cannot use the digest
field for directory traversal.
Download size cap. URL downloads are capped at 50 GiB
(MAX_DOWNLOAD_SIZE), checked during streaming. A malicious or
misbehaving server cannot fill the disk by sending an unbounded
response.
Rootfs name validation. The -r/--fs parameter (on create and
new) is validated with validate_name() (alphanumeric, hyphens, no
leading/trailing hyphens, no ..) before being used to construct
filesystem paths.
Opaque directory validation. Paths must be absolute, contain no
.. components, no empty strings, no duplicates. Normalised before
storage.
Permission hardening. Config files are written with mode 0o600,
config directories with 0o700. Overlayfs work directories get mode
0o700.
Umask enforcement. Container creation refuses to proceed if the
process umask strips read or execute from "other"
(umask & 005 != 0). A restrictive umask would make overlayfs
upper-layer files inaccessible to non-root services like dbus-daemon,
preventing container boot. This actually happened during development
when setting umask to 007 before the sdme create command, would
result in sdme start mysteriously misbehaving.
If you find a way to escape a container, traverse a path, or corrupt the host filesystem through sdme, please open an issue.
Multi-step operations in sdme are designed to fail cleanly rather than leave broken state behind.
Transactional imports use a staging directory
(.{name}.importing) that is atomically renamed on success. Partial
imports are either cleaned up immediately or detected and reported on
the next run.
Cooperative interrupt handling uses a global AtomicBool flag set
by a POSIX SIGINT handler. The handler is installed without
SA_RESTART, so blocking system calls (file reads, network I/O) return
EINTR immediately. Import loops, boot-wait loops, and build
operations check the flag between steps, allowing Ctrl+C to cancel
cleanly at any point. The handler also restores the default SIGINT
disposition after the first press, so a second Ctrl+C force-kills the
process. This covers cases where Rust's stdlib retries
poll()/connect() on EINTR, preventing cooperative cancellation
during blocked DNS resolution or TCP connection attempts.
Boot failure cleanup differs by intent: sdme new removes the
container on boot or join failure (the user wanted a running container,
not a broken one), while sdme start (from previous sdme create)
preserves it for debugging.
Health checks in sdme ps detect containers with missing
directories or missing rootfs and report them as broken rather than
crashing or silently hiding them.
Build failure cleanup removes the staging container and any partial rootfs on error, regardless of which build step failed.
If you find a way to leave sdme's state inconsistent (a container that can't be listed, removed, or recovered), please open an issue.