Close pending gap#365
Open
breardon2011 wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Harden sandbox billing cleanup (cell + edge) for the billing cutover
Context
Compute billing is metered per sandbox: the cell keeps an open
sandbox_scale_eventsrow while a sandbox is live, and the edge records oneusage_samplesrow per worker-emitted tick. After the cutover the edge becomesthe billing authority, so any path that leaves a sandbox billing or ticking
after it is actually gone is a correctness problem. This PR closes the remaining
such paths across all three layers.
Cell — close billing for sandboxes that never reach a clean terminal state
MarkOrphanedSandboxesnow reapspendingas well asrunning. A create thatnever reached
runningon a worker that has since disappeared previously keptits scale_event open; it is now closed (and
stoppedpublished to the edge) onthe next maintenance pass.
pendingon a live worker is left alone — it maystill be mid-create.
ReapStalePendingSessions: an age-based backstop for a create that hangsin
pendingon a worker that is still alive (so it is never an orphan).Anything
pendingpast 15m →failed, scale_event closed,stoppedpublished.terminatedis now classified as a terminal session status (consistent withthe edge), so any writer of that status also stops billing.
Edge — only count ticks from the sandbox's current worker
events-ingest. Edge usage is counted per tick,so a worker that is no longer the sandbox's owner — a migration source, a
deregistered worker, or an abandoned migration receiver — must not contribute
usage. A
usage_tickis dropped when the emitting worker is not the sandbox'scurrent owner in
sandboxes_index, or the sandbox is already terminal there.Lookup failure allows the tick (degraded, never blocks billing). The cell side
needs no equivalent — its scale_events are keyed per sandbox, not per tick.
Worker — reap abandoned migration receivers
-incoming); if its migration nevercompletes it stays alive (so it would be ticked) without being a running
sandbox here. The control plane aborts receivers only on an explicit migration
failure; the worker now tears down any receiver still un-resumed past 10m via
the ghost-reaper. Marked at registration, cleared on successful completion.
Control plane — periodic reverse-reconcile
stoppedfor anything the cell thinks is running that the worker does not have)previously ran only on worker rejoin. It now runs every 3m for all live
workers, so a worker that is deleted and never rejoins no longer leaves stale
runningrows on the edge index with nothing to correct them.Coverage
pendingon a dead workerpendingon a live worker (hung create)runningrow, no terminal event ever fired