fix(inference): enforce devshard kill switch on escrow create and settle#1279
fix(inference): enforce devshard kill switch on escrow create and settle#1279ouicate wants to merge 2 commits into
Conversation
The Guardian-gated MsgSetDevshardRequestsEnabled previously only wrote the DevshardRequestsEnabled flag to params; neither CreateDevshardEscrow nor SettleDevshardEscrow ever read it, so flipping the switch off did nothing at the chain level. Users could still lock funds and settlements still moved module balances during the exact incident the switch exists for. Both handlers now read DevshardEscrowParams.DevshardRequestsEnabled and reject early with "devshard requests are currently disabled" before any SendCoinsFromAccountToModule / payout / refund call. Default value (DefaultDevshardRequestsEnabled = true) is unchanged, so only an explicit Guardian "disable" now halts on-chain escrow create and settle. Tests: - TestCreateDevshardEscrow_RequestsDisabledBlocks - TestSettleDevshardEscrow_RequestsDisabledBlocks - TestCreateDevshardEscrow_ParamsOverrideDefaults updated to set the flag explicitly in its custom params struct
|
The main point that devshard is still for experiments and creation of escrows is whitelisted.
So preventing escrows from creation doesn't seem to be good, because enyway only core team now can create escrows, and the team can switch Even if preventing from devshard creation when Overall I think, it doesn't add significant value |
I want to separate the create-side argument (where I think the framing is off) from the settle-side argument (where you have a real point). 1. Today the chain reads the flag in zero handlers
2. The proposal that shipped this message calls it an "emergency switch"The v0.2.13 governance README, which introduced It also adds a guardian-controlled emergency switch for disabling devshard
inference requests.- Adds `MsgSetDevshardRequestsEnabled`, a guardian-signed transaction for
emergency disabling and re-enabling devshard inference requests.I also checked all 3. Allowlist ≠ kill switch(a) Default code path is open — empty allowlist means everyone passes: func (k Keeper) IsAllowedEscrowCreator(ctx context.Context, address string) bool {
ep := k.GetDevshardEscrowParams(ctx)
if len(ep.AllowedCreatorAddresses) == 0 {
return true
}
...
}So any deployment that hasn't run the v0.2.11/v0.2.13 upgrade in order are wide open. (b) Mainnet isn't "just the core team". v0.2.13 adds external operators on top of the v0.2.11 list — final mainnet list is 17 addresses, including: var devshardAllowedCreatorAddressesToAdd = []string{
// Gonka Labs
"gonka1r2s0rwgskp6y4ed7qr7d25qdwjwlvpp6demv90",
// Hyperfusion
"gonka1ls8wqecwj369du8s2t9a223xu9sgvmzlw2ye9c",
// 6Blocks
"gonka10wmset95nhgfjt4wklsyjqpx55m40zy3gha2pn",
// https://gonkabroker.com/,
"gonka17ld2g62230w0erzexefzw03sw0adtuchr425rp",
}Per-call exposure: v0.2.11 set These are two different controls: allowlist = authorization (who may call). 4. "Off-chain hosts already enforce it" — not reliably(a) go func() {
ticker := time.NewTicker(60 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
refresh()
}
}
}()So during an incident, hosts can keep serving for up to a minute after the flag is flipped. (b) dapi fails open on params query errors. params, err := d.queryClient.Params(ctx, &types.QueryParamsRequest{})
if err != nil {
logging.Error("Failed to get params", types.Validation, "error", err)
} else {
...
if d.availability != nil {
d.availability.Record(
params.Params.DevshardEscrowParams.DevshardRequestsEnabled,
...
)
}
}Startup value is hard-coded to (c) Even when the off-chain block engages, it covers a narrow set: func requestBlockedWhenUnavailable(req HostRequest) bool {
if req.Payload != nil {
return true
}
for _, diff := range req.Diffs {
for _, tx := range diff.Txs {
if tx.GetStartInference() != nil ||
tx.GetTimeoutInference() != nil ||
tx.GetValidation() != nil ||
tx.GetValidationVote() != nil {
return true
}
}
}
return false
}An honest host with the flag off still co-signs settlement flows. And a malicious host can simply ignore 5. The settle-side concern — you have a real pointYour strongest objection is the settle half, and I want to be honest about it. The unsettled-escrow pruner runs every block from const DevshardPruningThreshold = uint64(2)
...
func (k Keeper) distributeUnsettledEscrow(ctx context.Context, escrow types.DevshardEscrow) error {
...
share := escrow.Amount / uniqueCount Remover: func(ctx context.Context, key collections.Pair[uint64, uint64]) error {
...
escrow, found := k.GetDevshardEscrow(ctx, escrowID)
if found && !escrow.Settled {
if err := k.distributeUnsettledEscrow(ctx, escrow); err != nil {
...
}
}
if err := k.DevshardEscrows.Remove(ctx, escrowID); err != nil {So if a Guardian keeps the switch off for ≥2 epochs, the settle-side block converts every in-flight escrow into a 100%-to-validators payout at prune time — exactly the "kill switch strands user funds" failure mode. Your intuition on settle lands here. Proposal
Happy to push that change in this PR if @gmorgachev agree — the create-side gate is independently worth landing. |
Summary
This fixes a safety-control gap where the devshard kill switch (
DevshardRequestsEnabled) was never enforced on the two fund-moving handlers. The chain ships a dedicated param and a Guardian-gated message (MsgSetDevshardRequestsEnabled) whose entire purpose is to halt devshard during an incident, butSetDevshardRequestsEnabledonly wrote the param value — neitherCreateDevshardEscrownorSettleDevshardEscrowever read it. Both handlers read params for other fields (GetDevshardEscrowParams,MaxNonce) but skipped the flag, so flipping the switch off did nothing at the chain level: users could still lock funds viaSendCoinsFromAccountToModuleand settlements still executed and moved module funds. This is the same code path validators and escrow creators run in production, so the kill switch was non-functional for the exact emergency it exists to handle.Root Cause
SetDevshardRequestsEnabled(msg_server_set_devshard_requests_enabled.go) persistsparams.DevshardEscrowParams.DevshardRequestsEnabled = msg.Enabledand returns. No other on-chain handler consulted that value.CreateDevshardEscrowvalidated amount range, epoch caps, allowlist permission, and group membership, then locked funds — with no kill-switch check.SettleDevshardEscrowvalidated permission, the escrow, and the settlement proof, then paid validators and refunded the creator — also with no kill-switch check. The flag was read only off-chain (host runtime / dispatcher), and neitherattacks.mdnorproxy.mddocuments the on-chain flag as an "off-chain hint only", so there is no design basis for the chain ignoring it. The omission is an oversight, not an intentional access-control boundary.Fix
CreateDevshardEscrow: after loadingDevshardEscrowParams, return an error ("devshard requests are currently disabled") whenDevshardRequestsEnabledis false, before any funds are locked.SettleDevshardEscrow: after loadingDevshardEscrowParams(and the existing nil-guard), return the same error when the flag is false, before signature verification and before any payouts/refunds move module funds.DefaultDevshardRequestsEnabled = true) is unchanged, so normal operation is unaffected; only an explicit Guardian "disable" now actually halts on-chain escrow creation and settlement.Why This Closes The Vulnerability
The kill switch is a safety control, not an access control. Before this change, the only effect of disabling devshard was that off-chain hosts stopped serving, while the chain kept accepting fund-locking escrows and executing settlements — meaning during the exact incident an operator would flip the switch for (e.g. a suspected settlement exploit), funds could still be locked and drained from the module. After this change, both fund-moving entry points read the flag and reject early, so a disabled switch freezes the subsystem on-chain: no new funds are locked and no settlement can move module balances. Funds in existing escrows remain safely locked in place and become settleable again the moment the switch is re-enabled, so the freeze is temporary and governance-controlled rather than a permanent lock.
Test plan
go test ./x/inference/keeper/...ininference-chain(Linux + cgo toolchain).TestCreateDevshardEscrow_RequestsDisabledBlocks— with allowlist open butDevshardRequestsEnabled=false, creation is rejected with"devshard requests are currently disabled"and noSendCoinsFromAccountToModulemock is invoked (funds are never locked).TestSettleDevshardEscrow_RequestsDisabledBlocks— with a stored escrow, allowlist open, andDevshardRequestsEnabled=false, settlement is rejected with the same error before any bank transfer.TestCreateDevshardEscrow_ParamsOverrideDefaultsnow setsDevshardRequestsEnabled: truein its custom params struct.