Skip to content

feat(devshard): settle depleted escrow after in-flight requests drain#1314

Open
qdanik wants to merge 2 commits into
gonka-ai:dl/gateway-v2from
qdanik:qd/settlement-after-active-reuqest
Open

feat(devshard): settle depleted escrow after in-flight requests drain#1314
qdanik wants to merge 2 commits into
gonka-ai:dl/gateway-v2from
qdanik:qd/settlement-after-active-reuqest

Conversation

@qdanik

@qdanik qdanik commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Defer escrow settlement until in-flight requests drain. When a devshard escrow is deactivated (low balance, nonce depletion, or rotation), settlement was scheduled immediately — even with requests still running against that escrow. This PR gates settlement on the escrow draining to zero active requests, so we never settle on-chain underneath live inference.

  • Drain-gated settlement — deactivation marks the escrow settlement_pending; the last request to finish triggers the settle, or it settles immediately if already idle.
  • Restart-safe — the pending marker is persisted (new settlement_pending column + migration) and reconciled on startup, so a crash mid-drain never loses a settlement.
  • No behavior change when idle — an escrow with no in-flight requests settles exactly as before.

Scope: devshard/cmd/devshardctl only. No chain, api, or ml-node changes.


Problem → Solution

1. Settlement raced live requests

Before. deactivateAndSettleDevshardByID stopped new traffic and then called scheduleAutoSettlement unconditionally. If activeRequests > 0 at that moment, settlement raced the in-flight requests that were still reserving and releasing tokens against the escrow being settled.

After. Deactivation blocks new reservations and marks the escrow settlement_pending; if requests are still in flight it returns early. The drain hook in releaseRuntime reads the decremented count and fires the settlement at exactly remaining == 0. Because active=false blocks new reservations, the count only drains downward, so that is the precise "last request finished" edge. scheduleAutoSettlement dedups, so a double-fire is harmless. If the escrow is already idle at deactivation time, it settles immediately — preserving prior behavior.

2. A restart mid-drain could drop the settlement

Before. The pending state lived only in memory, so a restart while requests were draining would lose the obligation to settle.

After. The marker is persisted to the gateway store. On startup reconcilePendingSettlements settles escrows left pending by a pre-restart drain (after a restart nothing is in flight, so they settle right away). clearSettlementPending runs after a successful settle so a later reconcile never re-settles, and upsertDevshardTx preserves the existing marker so an unrelated upsert never silently clears a queued settlement.


Concurrency

The drain hook in releaseRuntime is lock-free. settlementReason is written before settlementPending (under g.mu) and read after the atomic settlementPending.Load() in the hook — the atomic Store→Load pair supplies the happens-before, so the reason is always consistent when the flag is observed. Verified clean under go test -race.


Risks & mitigations

  • Settlement delayed indefinitely if an escrow never drains. Mitigated: deactivation blocks new reservations, so the active count can only decrease — the escrow drains as soon as its current requests finish.
  • Crash between marking pending and settling. Mitigated: the marker is persisted and reconciled on startup.
  • Double-settlement on a race or reconcile overlap. Mitigated: scheduleAutoSettlement deduplication plus clearSettlementPending after success.

Test plan

go test ./cmd/devshardctl/... (and -race) — all green:

  • TestEnqueueSettlementWaitsForActiveRequests — pending while a request is in flight; exactly one settlement fires on drain; marker cleared and persisted.
  • TestEnqueueSettlementSettlesImmediatelyWhenDrained — no active requests → settles immediately.
  • TestReconcilePendingSettlementsSettlesDrainedEscrow — startup reconcile settles an escrow flagged before a restart.
  • TestReconcilePendingSettlementsSkipsActiveOrUnflagged — reconcile is a no-op for active or unflagged escrows.
  • TestGatewayStoreSetDevshardSettlementPending — persistence round-trip, survives an unrelated upsert, errors on unknown id.
  • go test -race clean — confirms the lock-free drain hook's happens-before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant