qemu: harden hibernate/checkpoint against rootfs corruption#238
Open
qemu: harden hibernate/checkpoint against rootfs corruption#238
Conversation
Four interlocked guards against the failure mode where savevm captures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). 1. Hibernate/CreateCheckpoint hard-fail when the in-VM agent is unresponsive, instead of silently proceeding to savevm. prepareAgentForHibernate and quiesceAndCloseAgent now return error; doHibernate and CreateCheckpoint propagate it as ErrAgentUnresponsive. Without this, savevm against a guest with un-synced page cache and pending EXT4 journal entries leaves the qcow2 with broken directory metadata that can't be re-mounted. 2. Explicit qmp.Stop() before SaveVM, qmp.Cont() after. savevm internally pauses/resumes the VM, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent's sync and the start of savevm. Standard QEMU quiesce pattern. 3. Bind-mount /var/cache/apt/archives onto /home/sandbox/.osb-apt-cache on every wake / migrate / golden-create. apt commonly stages 1-3 GB in this directory during installs; redirecting it to the workspace disk keeps the rootfs from filling up. Idempotent (mountpoint -q short-circuits); does not modify the guest's /etc/fstab; failure is non-fatal (apt-cache stays on rootfs as before). 4. Disk-pressure telemetry + refusal. Agent's Stats RPC reports statvfs("/") and statvfs("/home/sandbox") in four new wire-compat fields (older agents return zero, treated as "unknown"). Worker refuses Hibernate/CreateCheckpoint at >=95% rootfs use and logs a warning at >=85%, surfacing the failure mode early instead of letting the trigger condition produce a corrupted snapshot. Fix #1 is the load-bearing interlock; #2-#4 reduce how often the trigger fires. #1 alone would have made the recent corruption incident a "sandbox stuck, killed and respawned" event instead of a data-loss event. Tests pass on Linux; one pre-existing test that requires qemu-img on PATH skipped on macOS dev machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress testing surfaced a false-positive in the previous commit: the "client connection is closing" error (gRPC Canceled) is a transient that fires when the agent's gRPC channel is mid-recycle, not a sign that the agent is unresponsive. The original Fix #1 treated it as terminal and refused hibernate against healthy sandboxes under heavy I/O. This commit adopts the same redial-and-retry pattern already used in SyncFS (manager.go:3501), Exec (1933), and patchGuestNetwork (1217): - PrepareHibernate RPC: on IsTransportError, Redial() and retry once. - Fallback Exec("sync; …; kill -USR1 1"): same retry pattern. - Only after both retries fail do we surface ErrAgentUnresponsive. Persistent agent unresponsiveness (the original incident's failure mode) still triggers the refusal — IsTransportError + Redial() + retry will all fail when the agent is genuinely wedged for tens of seconds. Also adds scripts/qemu-tests/40-corruption-guards.sh — stress-tests for all four guards in this PR. Section 3 (apt-cache bind-mount) passes 8/8 across spawn + hibernate-wake. Section 2 (checkpoint+fork under heavy I/O) is what surfaced this redial-retry bug; passes 6/6 with the fix in place. Section 1 (refusal on dead agent) requires host SSH for QEMU SIGSTOP (PID 1 inside the guest is SIGNAL_UNKILLABLE) and skips cleanly when DEV_VM_HOST/SSH_KEY aren't set. Section 4 (refusal at >=95% rootfs) needs a base image with the new agent that fills the disk fields in StatsResponse — until that ships, sandboxes return RootfsTotalBytes==0 and the gate falls through (backward compatible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four interlocked guards against a class of failure where
savevmcaptures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). Each addresses a different ring of the failure chain; together they cover both the trigger (rootfs fills) and the killshot (savevm against unresponsive guest).The four fixes
1. Hard-fail
Hibernate/CreateCheckpointwhen guest agent is unresponsive (load-bearing)prepareAgentForHibernatepreviously fell back from PrepareHibernate RPC → Exec(sync; …; kill -USR1 1), and silently returned even when both timed out. Callers then proceeded toqmp.SaveVM()against a guest with un-synced page cache and pending EXT4 journal entries — exactly the case that produces an unbootable qcow2.Now both
prepareAgentForHibernateandquiesceAndCloseAgentreturnerror.doHibernateandCreateCheckpointpropagateErrAgentUnresponsiveand refuse to run savevm. The API returns a clean error to the caller; the rootfs stays intact.internal/qemu/manager.go,internal/qemu/snapshot.go2. Explicit
qmp.Stop()beforeSaveVM,Cont()aftersavevminternally pauses and resumes, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent'ssyncand the start of savevm. Standard QEMU quiesce pattern.Cont()is unconditional inCreateCheckpoint(sandbox keeps running); deliberately omitted indoHibernatesince it Quits next.internal/qemu/snapshot.go,internal/qemu/manager.go3. Bind-mount
/var/cache/apt/archivesonto workspaceapt commonly stages 1-3 GB of
.debfiles in/var/cache/apt/archivesduring installs. On a 4 GiB rootfs, this is a substantial chunk that's easy to redirect to the 20 GiB workspace. HelpersetupAptCacheBindMountis invoked from:reinstallProxyCA)Idempotent via
mountpoint -qshort-circuit. Lives in the kernel mount table only; does not modify guest/etc/fstab; re-applied on every resume. Failure is non-fatal (apt cache stays on rootfs, status quo).internal/qemu/manager.go,internal/qemu/snapshot.go,internal/qemu/migration.go4. Disk-pressure telemetry + refusal
Agent's
StatsRPC now reportsstatvfs("/")andstatvfs("/home/sandbox")in four new wire-compat fields (added with new field numbers per the proto stability contract; older agents return zero, treated as "unknown" → no gate, status quo).Worker preflights destructive operations:
ErrRootfsCritical, including the actual percent and a hint to free space or kill+respawn from a checkpoint.The check is best-effort: agent unreachable / older agent / Stats RPC failure all fall through to the pre-existing behavior, so this is backward compatible for long-lived sandboxes whose in-guest agent predates this PR.
proto/agent/agent.proto,proto/agent/agent.pb.go,proto/agent/agent_grpc.pb.go,internal/agent/stats.go,internal/qemu/manager.goHow they compose
#1 alone is the hard interlock that prevents corruption even if all others fail. #2-#4 reduce how often the trigger fires.
Test plan
go build ./...passes for all touched packages.go vet ./internal/qemu/... ./internal/agent/...clean.qemu-imgon PATH).oc sandbox hibernatereturnsErrRootfsCriticalinstead of triggering savevm.oc sandbox hibernatereturnsErrAgentUnresponsive, rootfs.qcow2 mtime unchanged.mountpoint /var/cache/apt/archivesin the guest → confirm bind-mount is in place;df -h /var/cache/apt/archivesshows the workspace disk.mountpoint -qshort-circuits subsequent runs.Notes
proto/agent/agent.proto: adds new fields with new field numbers; existing field numbers are untouched.setupAptCacheBindMountlives inmanager.gonext toreinstallProxyCAbecause they share a category (post-resume guest setup hooks).default.ext4/etc/fstabinstead. Worker-side injection ships faster, applies to all existing sandboxes on next wake, and is per-sandbox-gateable. Trade-off: doesn't survive an in-guest cold-reboot until next wake (rare in practice).🤖 Generated with Claude Code