Skip to content

Add server hibernation#51

Open
jmunckhof wants to merge 4 commits into
haydenbleasel:mainfrom
jmunckhof:feat/hibernate
Open

Add server hibernation#51
jmunckhof wants to merge 4 commits into
haydenbleasel:mainfrom
jmunckhof:feat/hibernate

Conversation

@jmunckhof

Copy link
Copy Markdown

Closes #50.

Adds a hibernate / wake lifecycle so an idle game server can stop paying compute. Snapshots the disk to Hetzner image storage, deletes the VM, and restores it on demand.

Summary

  • New desiredState: hibernated, observed states hibernating / hibernated / waking
  • Two new workflows: hibernateServer (drain agent → optional IP reserve → shutdown → snapshot → delete VM) and wakeServer (create VM from snapshot → wait for agent → drop snapshot)
  • New server actions: hibernate, wake, plus a hibernate-settings action to toggle "reserve IP across hibernate"
  • Settings panel: toggle for reserving the public IPv4 (off by default; ~€0.50/mo while hibernated)
  • Activity stream surfaces the new phases through emitActivity

Cost shape (rough)

  • Hibernated: ~€0.011/GB/mo snapshot storage, plus ~€0.50/mo if reserve-IP is on
  • Compared to a running CCX13 at ~€8.21/mo, that's roughly a 95%+ saving while idle

Notes for reviewer

  • prisma/schema.prisma changed but I haven't generated the migration yet — happy to add one before this comes out of draft
  • Hibernate / wake actions atomically claim the state transition with a conditional updateMany so a double-click can't fan out two workflows; both roll back the DB if the workflow start itself throws
  • Snapshot polling has an explicit timeout; on timeout the workflow throws FatalError instead of falling through to VM deletion (so a stuck snapshot can never silently nuke the world)

Out of scope

  • Auto-suspend after N minutes idle (sits on top of this primitive)
  • Cost preview in the UI
  • Cross-region restore (Hetzner snapshots are region-locked)

Test plan

  • Generate + apply Prisma migration
  • Hibernate a running server end-to-end, verify VM is gone and snapshot is retained
  • Wake from hibernation, verify the agent reconnects and the snapshot is cleaned up
  • Toggle reserve-IP on, hibernate, wake, confirm same IPv4 returns
  • Toggle reserve-IP off, hibernate, wake, confirm IP is allowed to change
  • Try hibernating an unprovisioned server / a server in hibernating state, confirm the action rejects cleanly
  • Force a snapshot timeout (mock or short deadline) and confirm the VM is NOT deleted

@vercel

vercel Bot commented May 8, 2026

Copy link
Copy Markdown

@jmunckhof is attempting to deploy a commit to the Hayden Bleasel Team on Vercel.

A member of the Team first needs to authorize it.

@haydenbleasel

Copy link
Copy Markdown
Owner

Nice work on this — the overall shape (atomic state claims, FatalError ordering so a stuck snapshot can't nuke the VM, idempotent step keys) is solid. A few things worth tackling before this comes out of draft:

Correctness

  • Snapshot is taken before the VM is confirmed off. stepShutdownHetzner returns immediately after firing the ACPI shutdown action, then stepCreateHibernationSnapshot runs against a still-draining VM. Hetzner recommends snapshotting an off VM — risk of an inconsistent image (i.e. silently corrupt game saves, which defeats the point). Suggest a poll loop on stepGetHetznerStatus until off before snapshotting, mirroring the existing snapshot-status loop.
  • Reserved IP is never released when the toggle is turned off. setHibernateSettings only flips the flag. If a user reserves, hibernates, wakes, then turns the toggle off, primaryIpv4Id stays set and Hetzner keeps charging for the reservation. stepReleaseReservedIpv4 only runs on teardown today. Either release in the settings action (when flipping true→false) or at the top of hibernateServer when the flag is off but a reservation exists.
  • Wake workflow plows on after a Hetzner-running timeout. If the VM never reaches running within MAX_HETZNER_WAIT_SECONDS, the loop falls through with runningIp = null and we still call stepMarkHetznerRunning and wait for an agent that will never connect. The symmetric path in hibernateServer throws FatalError on snapshot timeout — do the same here.
  • stepMarkAwake ignores intent. Unconditional updateMany({ where: { id: serverId } }) will clobber a Stop/Delete that landed during the reconnect wait and force observedState=running, phase=ready. Either gate on desiredState === \"running\" in the where clause or add an isCancelled check immediately before it.

Smaller stuff

  • stepCreateHibernationSnapshot short-circuits on any existing hibernationImageId, even if the prior snapshot ended up unavailable. Consider verifying status before reusing.
  • The // 422 = server already off comment in stepShutdownHetzner is probably wrong — Hetzner typically returns 409 Conflict for that case. Worth widening the accepted-status set or confirming against a real call.
  • setHibernateSettings does a layout-scoped revalidatePath on every toggle flick; consider scoping it to the server path.
  • Schema column alignment in Server is off vs. surrounding fields (cosmetic).

Things that read well

  • updateMany + state-in-where for the action claim, with DB rollback if start throws — exactly the right pattern for double-click safety.
  • FatalError on snapshot timeout before the delete-VM step.
  • Reserve-IP cost trade-off is surfaced in both the toggle copy and the confirm dialog (and the dialog adapts based on the flag).

Test plan looks comprehensive; only thing I'd add is a cheap unit test on the action-level guards (e.g. "rejects when observedState=hibernating") since those are easy to regress.

- Wait for VM off before snapshotting (avoids inconsistent image)
- Throw FatalError on wake VM-running timeout (mirrors snapshot path)
- Gate stepMarkAwake on desiredState=running so a concurrent
  Stop/Delete isn't clobbered
- Verify existing snapshot status before reusing in
  stepCreateHibernationSnapshot; discard and recreate if unavailable
- Pre-check VM status in stepShutdownHetzner instead of relying on a
  brittle 422-already-off heuristic
- Release reserved IP when reserveIpOnHibernate is toggled off (delete
  the IP if no VM, else flip auto_delete=true so it cleans up at next
  teardown)
- Scope hibernate-settings revalidatePath to the server page
- prisma format
@jmunckhof jmunckhof marked this pull request as ready for review May 10, 2026 07:38
jmunckhof added 2 commits May 11, 2026 14:10
Resolves conflicts by taking main's provider-pattern code. The old
hetzner-direct hibernate implementation is removed in this commit and
will be re-added on top using the new Provider abstraction.
Reimplements hibernation on top of the Provider interface that landed in
main (5c50664). Replaces direct Hetzner API calls with provider methods.

Provider interface gains shutdownServer(id) for graceful ACPI shutdown
before snapshotting (Hetzner warns that snapshots of running VMs may be
unreadable). Hetzner adapter implements via /servers/{id}/actions/shutdown
and treats 422 as already-off.

Server schema gains hibernationImageId + hibernatedAt; enums extended
with hibernated/hibernating/waking. Workflow steps use provider.createSnapshot,
getImage, deleteImage instead of raw API calls.

Teardown now also deletes any orphan hibernation snapshot so deleting a
hibernated server fully releases all resources.

IP-reservation is intentionally deferred to a follow-up PR.
@jmunckhof

Copy link
Copy Markdown
Author

Updated this on top of the provider refactor that landed in main. Pulled out the IP-reservation toggle for now
since it was Hetzner-specific raw API calls that don't fit the new Provider abstraction cleanly. I'll do that as
a follow-up PR

Rest of the flow is the same: STOP → shutdown → snapshot → delete VM on hibernate, then create from snapshot →
wait for agent → delete snapshot on wake. Snapshot is also cleaned up if you delete a hibernated server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hibernate idle servers to stop paying for compute when nobody's playing

2 participants