Skip to content

qemu: harden hibernate/checkpoint against rootfs corruption#238

Open
motatoes wants to merge 2 commits intomainfrom
fix/savevm-corruption-guards
Open

qemu: harden hibernate/checkpoint against rootfs corruption#238
motatoes wants to merge 2 commits intomainfrom
fix/savevm-corruption-guards

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

@motatoes motatoes commented May 9, 2026

Summary

Four interlocked guards against a class of failure where savevm captures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). Each addresses a different ring of the failure chain; together they cover both the trigger (rootfs fills) and the killshot (savevm against unresponsive guest).

The four fixes

1. Hard-fail Hibernate / CreateCheckpoint when guest agent is unresponsive (load-bearing)

prepareAgentForHibernate previously fell back from PrepareHibernate RPC → Exec(sync; …; kill -USR1 1), and silently returned even when both timed out. Callers then proceeded to qmp.SaveVM() against a guest with un-synced page cache and pending EXT4 journal entries — exactly the case that produces an unbootable qcow2.

Now both prepareAgentForHibernate and quiesceAndCloseAgent return error. doHibernate and CreateCheckpoint propagate ErrAgentUnresponsive and refuse to run savevm. The API returns a clean error to the caller; the rootfs stays intact.

internal/qemu/manager.go, internal/qemu/snapshot.go

2. Explicit qmp.Stop() before SaveVM, Cont() after

savevm internally pauses and resumes, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent's sync and the start of savevm. Standard QEMU quiesce pattern. Cont() is unconditional in CreateCheckpoint (sandbox keeps running); deliberately omitted in doHibernate since it Quits next.

internal/qemu/snapshot.go, internal/qemu/manager.go

3. Bind-mount /var/cache/apt/archives onto workspace

apt commonly stages 1-3 GB of .deb files in /var/cache/apt/archives during installs. On a 4 GiB rootfs, this is a substantial chunk that's easy to redirect to the 20 GiB workspace. Helper setupAptCacheBindMount is invoked from:

  • golden-create (inline with the workspace mount block)
  • wake (after reinstallProxyCA)
  • migration (post-resume hooks)

Idempotent via mountpoint -q short-circuit. Lives in the kernel mount table only; does not modify guest /etc/fstab; re-applied on every resume. Failure is non-fatal (apt cache stays on rootfs, status quo).

internal/qemu/manager.go, internal/qemu/snapshot.go, internal/qemu/migration.go

4. Disk-pressure telemetry + refusal

Agent's Stats RPC now reports statvfs("/") and statvfs("/home/sandbox") in four new wire-compat fields (added with new field numbers per the proto stability contract; older agents return zero, treated as "unknown" → no gate, status quo).

Worker preflights destructive operations:

  • >=85% rootfs: log warning so it surfaces in the journal/telemetry.
  • >=95% rootfs: refuse with ErrRootfsCritical, including the actual percent and a hint to free space or kill+respawn from a checkpoint.

The check is best-effort: agent unreachable / older agent / Stats RPC failure all fall through to the pre-existing behavior, so this is backward compatible for long-lived sandboxes whose in-guest agent predates this PR.

proto/agent/agent.proto, proto/agent/agent.pb.go, proto/agent/agent_grpc.pb.go, internal/agent/stats.go, internal/qemu/manager.go

How they compose

heavy apt activity
    │
    ├─→ rootfs hits 85% ─────→ #4 logs warning on next destructive op
    ├─→ rootfs hits 95% ─────→ #4 refuses Hibernate/CreateCheckpoint
    ├─→ /var/cache/apt/archives → workspace via #3, the 85/95 lines
    │     are now further away in practice
    └─→ if agent still wedges from disk pressure:
        Hibernate arrives
            └─→ #1 returns ErrAgentUnresponsive BEFORE savevm
                rootfs.qcow2 is untouched, no corruption
                customer sees clear error, kills + respawns

#1 alone is the hard interlock that prevents corruption even if all others fail. #2-#4 reduce how often the trigger fires.

Test plan

  • go build ./... passes for all touched packages.
  • go vet ./internal/qemu/... ./internal/agent/... clean.
  • Existing qemu unit tests pass (one pre-existing test skipped on macOS — needs qemu-img on PATH).
  • Verify on dev: simulate a sandbox with disk-full rootfs → confirm oc sandbox hibernate returns ErrRootfsCritical instead of triggering savevm.
  • Verify on dev: simulate a sandbox with hung agent (e.g., kill the agent process inside the guest) → confirm oc sandbox hibernate returns ErrAgentUnresponsive, rootfs.qcow2 mtime unchanged.
  • Verify on dev: spawn a fresh sandbox, run mountpoint /var/cache/apt/archives in the guest → confirm bind-mount is in place; df -h /var/cache/apt/archives shows the workspace disk.
  • Verify on dev: hibernate + wake a sandbox → confirm bind-mount is re-applied on the destination and mountpoint -q short-circuits subsequent runs.
  • Run a full apt-heavy build workload (the kind that previously consumed 3+ GB on rootfs) → confirm rootfs use% stays well under 85% with the apt-cache redirect in place.

Notes

  • The proto change is wire-compatible per the stability contract in proto/agent/agent.proto: adds new fields with new field numbers; existing field numbers are untouched.
  • setupAptCacheBindMount lives in manager.go next to reinstallProxyCA because they share a category (post-resume guest setup hooks).
  • I considered baking the bind-mount into the platform default.ext4 /etc/fstab instead. Worker-side injection ships faster, applies to all existing sandboxes on next wake, and is per-sandbox-gateable. Trade-off: doesn't survive an in-guest cold-reboot until next wake (rare in practice).

🤖 Generated with Claude Code

motatoes and others added 2 commits May 8, 2026 19:02
Four interlocked guards against the failure mode where savevm captures a
qcow2 in an inconsistent state and the rootfs becomes unbootable on next
cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop).

1. Hibernate/CreateCheckpoint hard-fail when the in-VM agent is
   unresponsive, instead of silently proceeding to savevm.
   prepareAgentForHibernate and quiesceAndCloseAgent now return error;
   doHibernate and CreateCheckpoint propagate it as ErrAgentUnresponsive.
   Without this, savevm against a guest with un-synced page cache and
   pending EXT4 journal entries leaves the qcow2 with broken directory
   metadata that can't be re-mounted.

2. Explicit qmp.Stop() before SaveVM, qmp.Cont() after. savevm internally
   pauses/resumes the VM, but the explicit Stop closes a small race where
   in-flight virtio-blk writes can land in the qcow2 between the agent's
   sync and the start of savevm. Standard QEMU quiesce pattern.

3. Bind-mount /var/cache/apt/archives onto /home/sandbox/.osb-apt-cache
   on every wake / migrate / golden-create. apt commonly stages 1-3 GB
   in this directory during installs; redirecting it to the workspace
   disk keeps the rootfs from filling up. Idempotent (mountpoint -q
   short-circuits); does not modify the guest's /etc/fstab; failure is
   non-fatal (apt-cache stays on rootfs as before).

4. Disk-pressure telemetry + refusal. Agent's Stats RPC reports
   statvfs("/") and statvfs("/home/sandbox") in four new wire-compat
   fields (older agents return zero, treated as "unknown"). Worker
   refuses Hibernate/CreateCheckpoint at >=95% rootfs use and logs a
   warning at >=85%, surfacing the failure mode early instead of letting
   the trigger condition produce a corrupted snapshot.

Fix #1 is the load-bearing interlock; #2-#4 reduce how often the trigger
fires. #1 alone would have made the recent corruption incident a
"sandbox stuck, killed and respawned" event instead of a data-loss event.

Tests pass on Linux; one pre-existing test that requires qemu-img on
PATH skipped on macOS dev machines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress testing surfaced a false-positive in the previous commit: the
"client connection is closing" error (gRPC Canceled) is a transient that
fires when the agent's gRPC channel is mid-recycle, not a sign that the
agent is unresponsive. The original Fix #1 treated it as terminal and
refused hibernate against healthy sandboxes under heavy I/O.

This commit adopts the same redial-and-retry pattern already used in
SyncFS (manager.go:3501), Exec (1933), and patchGuestNetwork (1217):

  - PrepareHibernate RPC: on IsTransportError, Redial() and retry once.
  - Fallback Exec("sync; …; kill -USR1 1"): same retry pattern.
  - Only after both retries fail do we surface ErrAgentUnresponsive.

Persistent agent unresponsiveness (the original incident's failure mode)
still triggers the refusal — IsTransportError + Redial() + retry will
all fail when the agent is genuinely wedged for tens of seconds.

Also adds scripts/qemu-tests/40-corruption-guards.sh — stress-tests for
all four guards in this PR. Section 3 (apt-cache bind-mount) passes
8/8 across spawn + hibernate-wake. Section 2 (checkpoint+fork under
heavy I/O) is what surfaced this redial-retry bug; passes 6/6 with the
fix in place. Section 1 (refusal on dead agent) requires host SSH for
QEMU SIGSTOP (PID 1 inside the guest is SIGNAL_UNKILLABLE) and skips
cleanly when DEV_VM_HOST/SSH_KEY aren't set. Section 4 (refusal at
>=95% rootfs) needs a base image with the new agent that fills the
disk fields in StatsResponse — until that ships, sandboxes return
RootfsTotalBytes==0 and the gate falls through (backward compatible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant