How to bring up a forkd controller, what to watch in production, and how to recover from common failure modes.
- x86_64 Linux, kernel 5.10+ (5.20+ recommended for free RNG re-seed)
- KVM available (
/dev/kvmpresent, your user in thekvmgroup) - Firecracker binary v1.7+ on the
$PATH - cgroup v2 unified hierarchy (
mount -t cgroup2) - iproute2
sudo bash scripts/setup-host.sh # KVM, Firecracker, KSM tuning
sudo bash scripts/netns-setup.sh 100 # provision N child netns
cargo build --release
sudo install -m 0755 target/release/forkd-controller /usr/local/bin/
sudo install -m 0755 target/release/forkd /usr/local/bin/
sudo install -m 0644 packaging/systemd/forkd-controller.service /etc/systemd/system/
sudo mkdir -p /etc/forkd /var/lib/forkd /var/log/forkd
sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token
# Optional but recommended for any non-loopback bind: TLS.
# Drop a cert.pem + key.pem (Let's Encrypt or your internal CA) into
# /etc/forkd/tls/ and add --tls-cert / --tls-key to the systemd ExecStart.
sudo systemctl daemon-reload
sudo systemctl enable --now forkd-controllerVerify:
curl http://127.0.0.1:8889/healthz
# {"ok":true}
curl http://127.0.0.1:8889/metrics
# forkd_sandboxes_active 0The daemon's token lives in /etc/forkd/token. Rotate by writing a
new value and restarting the daemon (systemctl restart forkd-controller). Existing sandboxes survive the restart; in-flight
HTTP requests do not.
Scrape :8889/metrics from Prometheus. Names that should always
exist: forkd_snapshots_total, forkd_sandboxes_active,
forkd_build_info.
Suggested alerts:
forkd_sandboxes_active> 80% of host vCPU count for 5 min- absence of
forkd_build_infofor 1 min → daemon down
/var/log/forkd/audit.log, one JSON object per line:
{"ts":"2026-05-12T07:12:34Z","method":"POST","path":"/v1/sandboxes","status":201,"latency_us":98342,"ua":"forkd-cli/0.1"}Rotate with logrotate. The daemon reopens the file on SIGHUP is
not yet implemented — for now, systemctl restart after a rotate.
Another process is already on that port. ss -ltnp | grep 8889 to find
it. Stale daemon? pkill -f forkd-controller then systemctl start.
Usually one of:
- The snapshot was created against a different kernel than what's available on this host. Re-create with the current kernel.
- Out of disk on
/tmp— each child needs ~5 MiB of work-dir space. - Out of memory — host has hit
memory.maxon/sys/fs/cgroup/forkd/or has no free RAM. Tunememory_limit_mibper child.
Network namespace mismatch. With per_child_netns: true, the agent is
reachable only from inside its netns; the daemon does the setns(2)
dance for you, but the host must have /var/run/netns/forkd-child-<i>
provisioned via scripts/netns-setup.sh N. Re-run that script if you've
restarted networking.
The daemon checks /proc/<pid> for each registered sandbox's
Firecracker PID on startup. If the host rebooted (or the daemon's
state file outlived the FC processes), there's nothing to recover —
sandboxes don't survive host reboots. Create new ones from the
existing snapshots.
On a 20-vCPU / 30 GiB host (the dev-bench config), forkd has been exercised at N=200 children sharing one snapshot in ~750 ms wall-clock. Sustained throughput depends on:
- KSM enabled and tuned (
scripts/setup-host.shsets sensible defaults) - Snapshot's memory image size — smaller parents fork faster
- Whether you use
per_child_netns(adds ~3 ms per child for the netns + agent probe round-trip)
For your own host, run bench/bench-spawn-100.sh and look at the
forkd line in bench/chart-spawn-100.png.
- Take the daemon down:
systemctl stop forkd-controller. Live sandboxes get killed; persistent registry survives. cargo build --release && install -m 0755 target/release/forkd-controller /usr/local/bin/systemctl start forkd-controller. On startup,reconcile()prunes sandbox entries whose Firecracker PID is gone.- Verify
/versionreports the new build.