qemu: harden hibernate/checkpoint against rootfs corruption by motatoes · Pull Request #238 · diggerhq/opencomputer

motatoes · 2026-05-09T02:03:55Z

Summary

Four interlocked guards against a class of failure where savevm captures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). Each addresses a different ring of the failure chain; together they cover both the trigger (rootfs fills) and the killshot (savevm against unresponsive guest).

The four fixes

1. Hard-fail `Hibernate` / `CreateCheckpoint` when guest agent is unresponsive (load-bearing)

prepareAgentForHibernate previously fell back from PrepareHibernate RPC → Exec(sync; …; kill -USR1 1), and silently returned even when both timed out. Callers then proceeded to qmp.SaveVM() against a guest with un-synced page cache and pending EXT4 journal entries — exactly the case that produces an unbootable qcow2.

Now both prepareAgentForHibernate and quiesceAndCloseAgent return error. doHibernate and CreateCheckpoint propagate ErrAgentUnresponsive and refuse to run savevm. The API returns a clean error to the caller; the rootfs stays intact.

internal/qemu/manager.go, internal/qemu/snapshot.go

2. Explicit `qmp.Stop()` before `SaveVM`, `Cont()` after

savevm internally pauses and resumes, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent's sync and the start of savevm. Standard QEMU quiesce pattern. Cont() is unconditional in CreateCheckpoint (sandbox keeps running); deliberately omitted in doHibernate since it Quits next.

internal/qemu/snapshot.go, internal/qemu/manager.go

3. Bind-mount `/var/cache/apt/archives` onto workspace

apt commonly stages 1-3 GB of .deb files in /var/cache/apt/archives during installs. On a 4 GiB rootfs, this is a substantial chunk that's easy to redirect to the 20 GiB workspace. Helper setupAptCacheBindMount is invoked from:

golden-create (inline with the workspace mount block)
wake (after reinstallProxyCA)
migration (post-resume hooks)

Idempotent via mountpoint -q short-circuit. Lives in the kernel mount table only; does not modify guest /etc/fstab; re-applied on every resume. Failure is non-fatal (apt cache stays on rootfs, status quo).

internal/qemu/manager.go, internal/qemu/snapshot.go, internal/qemu/migration.go

4. Disk-pressure telemetry + refusal

Agent's Stats RPC now reports statvfs("/") and statvfs("/home/sandbox") in four new wire-compat fields (added with new field numbers per the proto stability contract; older agents return zero, treated as "unknown" → no gate, status quo).

Worker preflights destructive operations:

>=85% rootfs: log warning so it surfaces in the journal/telemetry.
>=95% rootfs: refuse with ErrRootfsCritical, including the actual percent and a hint to free space or kill+respawn from a checkpoint.

The check is best-effort: agent unreachable / older agent / Stats RPC failure all fall through to the pre-existing behavior, so this is backward compatible for long-lived sandboxes whose in-guest agent predates this PR.

proto/agent/agent.proto, proto/agent/agent.pb.go, proto/agent/agent_grpc.pb.go, internal/agent/stats.go, internal/qemu/manager.go

How they compose

heavy apt activity
    │
    ├─→ rootfs hits 85% ─────→ #4 logs warning on next destructive op
    ├─→ rootfs hits 95% ─────→ #4 refuses Hibernate/CreateCheckpoint
    ├─→ /var/cache/apt/archives → workspace via #3, the 85/95 lines
    │     are now further away in practice
    └─→ if agent still wedges from disk pressure:
        Hibernate arrives
            └─→ #1 returns ErrAgentUnresponsive BEFORE savevm
                rootfs.qcow2 is untouched, no corruption
                customer sees clear error, kills + respawns

#1 alone is the hard interlock that prevents corruption even if all others fail. #2-#4 reduce how often the trigger fires.

Test plan

go build ./... passes for all touched packages.
go vet ./internal/qemu/... ./internal/agent/... clean.
Existing qemu unit tests pass (one pre-existing test skipped on macOS — needs qemu-img on PATH).
Verify on dev: simulate a sandbox with disk-full rootfs → confirm oc sandbox hibernate returns ErrRootfsCritical instead of triggering savevm.
Verify on dev: simulate a sandbox with hung agent (e.g., kill the agent process inside the guest) → confirm oc sandbox hibernate returns ErrAgentUnresponsive, rootfs.qcow2 mtime unchanged.
Verify on dev: spawn a fresh sandbox, run mountpoint /var/cache/apt/archives in the guest → confirm bind-mount is in place; df -h /var/cache/apt/archives shows the workspace disk.
Verify on dev: hibernate + wake a sandbox → confirm bind-mount is re-applied on the destination and mountpoint -q short-circuits subsequent runs.
Run a full apt-heavy build workload (the kind that previously consumed 3+ GB on rootfs) → confirm rootfs use% stays well under 85% with the apt-cache redirect in place.

Notes

The proto change is wire-compatible per the stability contract in proto/agent/agent.proto: adds new fields with new field numbers; existing field numbers are untouched.
setupAptCacheBindMount lives in manager.go next to reinstallProxyCA because they share a category (post-resume guest setup hooks).
I considered baking the bind-mount into the platform default.ext4 /etc/fstab instead. Worker-side injection ships faster, applies to all existing sandboxes on next wake, and is per-sandbox-gateable. Trade-off: doesn't survive an in-guest cold-reboot until next wake (rare in practice).

🤖 Generated with Claude Code

Four interlocked guards against the failure mode where savevm captures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). 1. Hibernate/CreateCheckpoint hard-fail when the in-VM agent is unresponsive, instead of silently proceeding to savevm. prepareAgentForHibernate and quiesceAndCloseAgent now return error; doHibernate and CreateCheckpoint propagate it as ErrAgentUnresponsive. Without this, savevm against a guest with un-synced page cache and pending EXT4 journal entries leaves the qcow2 with broken directory metadata that can't be re-mounted. 2. Explicit qmp.Stop() before SaveVM, qmp.Cont() after. savevm internally pauses/resumes the VM, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent's sync and the start of savevm. Standard QEMU quiesce pattern. 3. Bind-mount /var/cache/apt/archives onto /home/sandbox/.osb-apt-cache on every wake / migrate / golden-create. apt commonly stages 1-3 GB in this directory during installs; redirecting it to the workspace disk keeps the rootfs from filling up. Idempotent (mountpoint -q short-circuits); does not modify the guest's /etc/fstab; failure is non-fatal (apt-cache stays on rootfs as before). 4. Disk-pressure telemetry + refusal. Agent's Stats RPC reports statvfs("/") and statvfs("/home/sandbox") in four new wire-compat fields (older agents return zero, treated as "unknown"). Worker refuses Hibernate/CreateCheckpoint at >=95% rootfs use and logs a warning at >=85%, surfacing the failure mode early instead of letting the trigger condition produce a corrupted snapshot. Fix #1 is the load-bearing interlock; #2-#4 reduce how often the trigger fires. #1 alone would have made the recent corruption incident a "sandbox stuck, killed and respawned" event instead of a data-loss event. Tests pass on Linux; one pre-existing test that requires qemu-img on PATH skipped on macOS dev machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stress testing surfaced a false-positive in the previous commit: the "client connection is closing" error (gRPC Canceled) is a transient that fires when the agent's gRPC channel is mid-recycle, not a sign that the agent is unresponsive. The original Fix #1 treated it as terminal and refused hibernate against healthy sandboxes under heavy I/O. This commit adopts the same redial-and-retry pattern already used in SyncFS (manager.go:3501), Exec (1933), and patchGuestNetwork (1217): - PrepareHibernate RPC: on IsTransportError, Redial() and retry once. - Fallback Exec("sync; …; kill -USR1 1"): same retry pattern. - Only after both retries fail do we surface ErrAgentUnresponsive. Persistent agent unresponsiveness (the original incident's failure mode) still triggers the refusal — IsTransportError + Redial() + retry will all fail when the agent is genuinely wedged for tens of seconds. Also adds scripts/qemu-tests/40-corruption-guards.sh — stress-tests for all four guards in this PR. Section 3 (apt-cache bind-mount) passes 8/8 across spawn + hibernate-wake. Section 2 (checkpoint+fork under heavy I/O) is what surfaced this redial-retry bug; passes 6/6 with the fix in place. Section 1 (refusal on dead agent) requires host SSH for QEMU SIGSTOP (PID 1 inside the guest is SIGNAL_UNKILLABLE) and skips cleanly when DEV_VM_HOST/SSH_KEY aren't set. Section 4 (refusal at >=95% rootfs) needs a base image with the new agent that fills the disk fields in StatsResponse — until that ships, sandboxes return RootfsTotalBytes==0 and the gate falls through (backward compatible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

motatoes and others added 2 commits May 8, 2026 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qemu: harden hibernate/checkpoint against rootfs corruption#238

qemu: harden hibernate/checkpoint against rootfs corruption#238
motatoes wants to merge 2 commits intomainfrom
fix/savevm-corruption-guards

motatoes commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

motatoes commented May 9, 2026

Summary

The four fixes

1. Hard-fail Hibernate / CreateCheckpoint when guest agent is unresponsive (load-bearing)

2. Explicit qmp.Stop() before SaveVM, Cont() after

3. Bind-mount /var/cache/apt/archives onto workspace

4. Disk-pressure telemetry + refusal

How they compose

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Hard-fail `Hibernate` / `CreateCheckpoint` when guest agent is unresponsive (load-bearing)

2. Explicit `qmp.Stop()` before `SaveVM`, `Cont()` after

3. Bind-mount `/var/cache/apt/archives` onto workspace