feat(devshard): settle depleted escrow after in-flight requests drain#1314
Open
qdanik wants to merge 2 commits into
Open
feat(devshard): settle depleted escrow after in-flight requests drain#1314qdanik wants to merge 2 commits into
qdanik wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defer escrow settlement until in-flight requests drain. When a devshard escrow is deactivated (low balance, nonce depletion, or rotation), settlement was scheduled immediately — even with requests still running against that escrow. This PR gates settlement on the escrow draining to zero active requests, so we never settle on-chain underneath live inference.
settlement_pending; the last request to finish triggers the settle, or it settles immediately if already idle.settlement_pendingcolumn + migration) and reconciled on startup, so a crash mid-drain never loses a settlement.Scope:
devshard/cmd/devshardctlonly. No chain, api, or ml-node changes.Problem → Solution
1. Settlement raced live requests
Before.
deactivateAndSettleDevshardByIDstopped new traffic and then calledscheduleAutoSettlementunconditionally. IfactiveRequests > 0at that moment, settlement raced the in-flight requests that were still reserving and releasing tokens against the escrow being settled.After. Deactivation blocks new reservations and marks the escrow
settlement_pending; if requests are still in flight it returns early. The drain hook inreleaseRuntimereads the decremented count and fires the settlement at exactlyremaining == 0. Becauseactive=falseblocks new reservations, the count only drains downward, so that is the precise "last request finished" edge.scheduleAutoSettlementdedups, so a double-fire is harmless. If the escrow is already idle at deactivation time, it settles immediately — preserving prior behavior.2. A restart mid-drain could drop the settlement
Before. The pending state lived only in memory, so a restart while requests were draining would lose the obligation to settle.
After. The marker is persisted to the gateway store. On startup
reconcilePendingSettlementssettles escrows left pending by a pre-restart drain (after a restart nothing is in flight, so they settle right away).clearSettlementPendingruns after a successful settle so a later reconcile never re-settles, andupsertDevshardTxpreserves the existing marker so an unrelated upsert never silently clears a queued settlement.Concurrency
The drain hook in
releaseRuntimeis lock-free.settlementReasonis written beforesettlementPending(underg.mu) and read after the atomicsettlementPending.Load()in the hook — the atomic Store→Load pair supplies the happens-before, so the reason is always consistent when the flag is observed. Verified clean undergo test -race.Risks & mitigations
scheduleAutoSettlementdeduplication plusclearSettlementPendingafter success.Test plan
go test ./cmd/devshardctl/...(and-race) — all green:TestEnqueueSettlementWaitsForActiveRequests— pending while a request is in flight; exactly one settlement fires on drain; marker cleared and persisted.TestEnqueueSettlementSettlesImmediatelyWhenDrained— no active requests → settles immediately.TestReconcilePendingSettlementsSettlesDrainedEscrow— startup reconcile settles an escrow flagged before a restart.TestReconcilePendingSettlementsSkipsActiveOrUnflagged— reconcile is a no-op for active or unflagged escrows.TestGatewayStoreSetDevshardSettlementPending— persistence round-trip, survives an unrelated upsert, errors on unknown id.go test -raceclean — confirms the lock-free drain hook's happens-before.