This document describes how forkd implements fork-on-write for Firecracker microVMs and the constraints that drove the design.
flowchart LR
subgraph PARENT_LIFE["1. Parent lifecycle (once per parent image)"]
direction TB
boot["Vm::boot(BootConfig)<br/>Firecracker InstanceStart"]
warm["userspace warm-up<br/>Python imports, JIT, model load"]
pause["Vm::pause<br/>PATCH /vm Paused"]
snapshot["Vm::snapshot_to<br/>writes memory.bin + vmstate"]
boot --> warm --> pause --> snapshot
end
subgraph FORK["2. Fork-out (per cohort of N children)"]
direction TB
spawn["Snapshot::restore_many_with<br/>spawn N Firecracker procs in parallel"]
load["PUT /snapshot/load on each<br/>mmap(memory.bin, MAP_PRIVATE)"]
place["place each PID into<br/>/sys/fs/cgroup/forkd/child-i<br/>memory.max = quota"]
ns["each child runs inside<br/>netns forkd-child-i"]
spawn --> load --> place --> ns
end
subgraph RUN["3. Runtime per child"]
direction TB
cow["kernel CoWs diverged pages<br/>shared pages stay shared"]
agent["forkd-agent.py listens on :8888<br/>(ping / exec / eval over TCP)"]
cow --- agent
end
PARENT_LIFE --> FORK --> RUN
classDef phase fill:#ffffff,stroke:#52606d,color:#1f2933;
class PARENT_LIFE,FORK,RUN phase;
The kernel does the hard part (CoW page management). forkd's job is correctness, isolation, and orchestration.
forkd builds on Firecracker rather than a container runtime or gVisor because:
- Snapshot/restore is first-class. Firecracker's
/snapshot/loadwithMEMORY_LOAD_PRIVATEis the exact primitive we need. - KVM-backed. Each child gets a hardware isolation boundary, not a syscall filter or namespace.
- Small. ~5 MiB of resident memory per VM process before any guest state.
- Stable API. Rust ecosystem, well-trodden by AWS Lambda.
forkd uses upstream Firecracker — no vendored fork.
crates/
forkd-vmm Firecracker wrapper. BootConfig, Vm, Snapshot,
ForkOpts, cgroup helpers, network namespace
plumbing, raw HTTP/1.1 over Unix socket.
forkd-cli `forkd` binary. CLI surface: snapshot, fork,
run, exec, eval, ping.
forkd-controller `forkd-controller serve`. REST API, persistent
registry, audit log, /metrics, bearer-token
auth, graceful shutdown.
rootfs-init/
forkd-init.sh PID 1 inside the guest. Mounts pseudo-fs, fixes
DNS, launches the agent.
forkd-agent.py TCP server on :8888 (ping / exec / eval).
sdk/python/ E2B-compatible Python SDK.
Putting memory.bin on tmpfs invites OOM kill. Slow disk kills
restore latency. forkd writes the image to ext4 by default and relies
on the page cache; on hosts with hugepages provisioned (per
scripts/setup-host.sh) the kernel transparently backs hot pages
with 2 MiB pages.
Future: explicit memfd_create(MFD_HUGETLB)-backed memory for
high-N fork-out, where the savings on page-table size dominate.
Children boot with the parent's RNG state and the parent's tsc_offset.
Both are cryptographically broken if exposed externally.
- RNG: Linux 5.20+ exposes
vmgenid, a virtio-device-driven "generation counter" that the guest kernel watches; Firecracker bumps the counter on restore and the guest's CRNG re-seeds from /dev/hw_random automatically. forkd relies on this — no userspace daemon required. - TSC: Firecracker assigns a fresh
tsc_offseton each restore via its--rdtschandling. This is enabled by default.
All children inherit the parent snapshot's MAC and guest IP. Without network isolation they would collide on the host bridge.
forkd places each child in its own pre-provisioned network namespace
(forkd-child-1 … forkd-child-N). Each namespace has:
- An independent tap (same name, same IP — different network stack).
- A
vethpair into a sharedforkd-br0bridge for outbound NAT. - SNAT on egress so the bridge can reverse-route replies.
See scripts/netns-setup.sh for the exact iptables rules.
Children need a writable rootfs but should share the base image.
Today forkd uses Firecracker's built-in read-write attachment with overlayfs on the host. Each child gets a fresh upper-dir, lower dir is the shared rootfs. Writes are per-child; nothing persists post-exit.
Future: dm-thin for production density beyond a few hundred
concurrent children.
Default KSM (kernel same-page merging) is too lazy — minutes to
reach steady-state sharing. scripts/setup-host.sh tunes
pages_to_scan and sleep_millisecs for forkd's workload. With CoW
mmap, KSM is a backstop for divergent-but-similar pages; it doesn't
need to do the heavy lifting.
If the host hits memory pressure and the OOM killer takes the parent process, every child loses its backing pages.
forkd nudges each child's oom_score_adj up by +500 so the kernel
picks a runaway child first. With per-child memory.max set via
cgroup v2, runaway children are bounded before they push the host
into global pressure.
forkd creates one cgroup v2 leaf per child under
/sys/fs/cgroup/forkd/child-N/ and writes the Firecracker PID to
cgroup.procs. Today only memory.max is wired into ForkOpts;
cpu.max / io.max / pids.max land before 1.0.
Children must land on the same host as their parent (otherwise CoW becomes copy-everything across the wire). v0.1 is single-host only; multi-host scheduling is a v1.x problem and will require either a warm parent on each scheduling target or a fast snapshot-replication path.
The controller daemon optionally requires a bearer token
(--token-file) on every request except /healthz. The check uses
length-aware constant-time comparison to avoid trivial timing oracles.
Every request is appended to a JSON-Lines audit log
(/var/log/forkd/audit.log by default): RFC3339 timestamp, method,
path, status, latency in microseconds, user-agent. Log rotation is
out of process (logrotate, vector, the journal).
The sandbox-runtime space has been growing fast. forkd's contribution is the fork-from-warm primitive on a full Linux microVM, with an open-source operator surface (REST + auth + TLS + audit + metrics). This section sketches how that compares to the projects most worth benchmarking against.
CubeSandbox is the closest open-source project to forkd in primitive choice: RustVMM- based microVMs, KVM isolation, Apache 2.0. The published P95 cold- start is "<60 ms" with per-instance memory overhead below 5 MiB, which beats forkd's pure cold-boot path (forkd's snapshot fork wins on the fan-out workload because it skips guest userspace warm-up, not because the VM boots faster). CubeSandbox's roadmap mentions "event-level snapshot rollback" with "high-frequency snapshot rollback at millisecond granularity, enabling rapid fork-based exploration environments from any saved state" — when that lands, the two projects will overlap meaningfully. Until then forkd's distinct value is that fork-from-warm exists today.
Daytona is OCI-workspace oriented (Docker-compatible images, per-workspace kernel claim). They advertise "<90 ms spinning up... from code to execution" and a stateful-snapshot model for resume. There is no fork-from-warm primitive — each workspace is its own resource. License is AGPL-3.0, which is a meaningful difference for commercial users embedding the runtime in proprietary services. Daytona's polish at the workspace + agent-protocol layer is well ahead of forkd; the projects target different shapes of workload.
OpenSandbox is best thought of as an abstraction layer over Docker / Kubernetes / gVisor / Kata / Firecracker. It exposes a unified ingress gateway, per-sandbox egress policy, and multi-language SDKs (Python, Java, JS, .NET, Go). Apache 2.0, actively maintained. OpenSandbox does not itself implement fork-from-warm; if you want that on top of OpenSandbox, you'd plug a runtime that supports it. Conceptually forkd could be slotted in as such a runtime in a future integration.
BoxLite bills itself as
"the SQLite of sandbox" — an embeddable microVM runtime with no
daemon, designed to scale from a laptop to the cloud without
changing primitives. Each Box runs an OCI container inside a
hardware-isolated VM (KVM on Linux, Hypervisor.framework on macOS),
with seccomp on the host side and allow_net for egress policy per
Box. Apache 2.0, primarily Rust + TypeScript + Go.
The interesting contrast with forkd is the stateful Box model.
BoxLite Boxes persist packages and files across stop/restart cycles:
you apt install once and that survives the next boot. forkd's
parent snapshot does something superficially similar — the parent's
RAM survives — but the semantics differ:
- BoxLite optimises for one long-lived sandbox per workload: start it, fill it with state, suspend, resume later, never rebuild.
- forkd optimises for N short-lived children per parent: the parent is the template, every child gets a fresh CoW divergence, children die freely.
Both designs avoid re-paying setup cost. They're complementary, not
competing — BoxLite is the right call when each agent owns a
persistent workspace; forkd is the right call when you fan out from
a warmed template. BoxLite's cross-platform support (macOS via
Hypervisor.framework) is a real differentiator forkd doesn't try to
match — forkd is Linux+KVM only and unlikely to add macOS support
because the snapshot-fork primitive depends on the host kernel's
CoW semantics on mmap(MAP_PRIVATE).
E2B ships an open-source self-host
path (Apache 2.0) and a managed service. The OSS infra repo uses
Firecracker under the hood; specific spawn-time numbers are mostly
quoted from the managed product. There is no fork-from-warm primitive
in the open repo. forkd's Python SDK is E2B wire-compatible at the
Sandbox class level so existing E2B agents can switch by import
alone, with sandbox.eval(...) as the forkd-only extra that uses the
warmed-PID-1 interpreter.
Modal is the only production system known to expose a fork-from-warm primitive ("Modal Sandbox", proprietary). They are not open source; forkd is the open-source analogue of that primitive specifically. Pricing, scheduling, and the full developer-platform layer remain their differentiator.
These are runtimes, not full sandbox products. forkd builds on
Firecracker directly. Docker (runc) and gVisor (runsc) are in our
benchmark chart as honest reference points: they cold-boot every
sandbox and pay the import numpy cost N times for an N-sandbox
fan-out.
- Replace Modal or E2B as a SaaS.
- Beat function-level snapshot runtimes (single-vCPU, serial I/O) on raw spawn time; forkd targets full Linux microVMs with networking and multi-vCPU.
- Support arbitrary guest OSes. v0.1 targets Linux x86-64.
API versioning: the REST surface is at /v1. Breaking changes move
to /v2 and the previous major is supported for one minor release
after the new one ships. On-disk snapshot format is currently tied
to Firecracker's Full snapshot version; we do not promise forward
compatibility across forkd versions until 1.0.