Skip to content

Close pending gap#365

Open
breardon2011 wants to merge 3 commits into
mainfrom
fix/fail-pending
Open

Close pending gap#365
breardon2011 wants to merge 3 commits into
mainfrom
fix/fail-pending

Conversation

@breardon2011

Copy link
Copy Markdown
Contributor

Harden sandbox billing cleanup (cell + edge) for the billing cutover

Context

Compute billing is metered per sandbox: the cell keeps an open
sandbox_scale_events row while a sandbox is live, and the edge records one
usage_samples row per worker-emitted tick. After the cutover the edge becomes
the billing authority, so any path that leaves a sandbox billing or ticking
after it is actually gone is a correctness problem. This PR closes the remaining
such paths across all three layers.

Cell — close billing for sandboxes that never reach a clean terminal state

  • MarkOrphanedSandboxes now reaps pending as well as running. A create that
    never reached running on a worker that has since disappeared previously kept
    its scale_event open; it is now closed (and stopped published to the edge) on
    the next maintenance pass. pending on a live worker is left alone — it may
    still be mid-create.
  • New ReapStalePendingSessions: an age-based backstop for a create that hangs
    in pending on a worker that is still alive (so it is never an orphan).
    Anything pending past 15m → failed, scale_event closed, stopped published.
  • terminated is now classified as a terminal session status (consistent with
    the edge), so any writer of that status also stops billing.

Edge — only count ticks from the sandbox's current worker

  • New worker-validation guard in events-ingest. Edge usage is counted per tick,
    so a worker that is no longer the sandbox's owner — a migration source, a
    deregistered worker, or an abandoned migration receiver — must not contribute
    usage. A usage_tick is dropped when the emitting worker is not the sandbox's
    current owner in sandboxes_index, or the sandbox is already terminal there.
    Lookup failure allows the tick (degraded, never blocks billing). The cell side
    needs no equivalent — its scale_events are keyed per sandbox, not per tick.

Worker — reap abandoned migration receivers

  • A migration receiver is started paused (-incoming); if its migration never
    completes it stays alive (so it would be ticked) without being a running
    sandbox here. The control plane aborts receivers only on an explicit migration
    failure; the worker now tears down any receiver still un-resumed past 10m via
    the ghost-reaper. Marked at registration, cleared on successful completion.

Control plane — periodic reverse-reconcile

  • The reverse-reconcile (ask the worker what it actually hosts; close and publish
    stopped for anything the cell thinks is running that the worker does not have)
    previously ran only on worker rejoin. It now runs every 3m for all live
    workers, so a worker that is deleted and never rejoins no longer leaves stale
    running rows on the edge index with nothing to correct them.

Coverage

Way a sandbox could keep billing/ticking after it is gone Closed by
Stuck pending on a dead worker orphan reaper (extended to pending)
Stuck pending on a live worker (hung create) stale-pending age reaper
Ticks from a migration-source / deregistered worker edge tick guard
Abandoned paused migration receiver worker stale-incoming reaper
Stale edge running row, no terminal event ever fired periodic reverse-reconcile

@breardon2011 breardon2011 marked this pull request as ready for review June 11, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant