Skip to content

fix(inference): enforce devshard kill switch on escrow create and settle#1279

Open
ouicate wants to merge 2 commits into
gonka-ai:mainfrom
ouicate:fix/devshard-kill-switch-enforcement
Open

fix(inference): enforce devshard kill switch on escrow create and settle#1279
ouicate wants to merge 2 commits into
gonka-ai:mainfrom
ouicate:fix/devshard-kill-switch-enforcement

Conversation

@ouicate

@ouicate ouicate commented May 30, 2026

Copy link
Copy Markdown

Summary

This fixes a safety-control gap where the devshard kill switch (DevshardRequestsEnabled) was never enforced on the two fund-moving handlers. The chain ships a dedicated param and a Guardian-gated message (MsgSetDevshardRequestsEnabled) whose entire purpose is to halt devshard during an incident, but SetDevshardRequestsEnabled only wrote the param value — neither CreateDevshardEscrow nor SettleDevshardEscrow ever read it. Both handlers read params for other fields (GetDevshardEscrowParams, MaxNonce) but skipped the flag, so flipping the switch off did nothing at the chain level: users could still lock funds via SendCoinsFromAccountToModule and settlements still executed and moved module funds. This is the same code path validators and escrow creators run in production, so the kill switch was non-functional for the exact emergency it exists to handle.

Root Cause

SetDevshardRequestsEnabled (msg_server_set_devshard_requests_enabled.go) persists params.DevshardEscrowParams.DevshardRequestsEnabled = msg.Enabled and returns. No other on-chain handler consulted that value. CreateDevshardEscrow validated amount range, epoch caps, allowlist permission, and group membership, then locked funds — with no kill-switch check. SettleDevshardEscrow validated permission, the escrow, and the settlement proof, then paid validators and refunded the creator — also with no kill-switch check. The flag was read only off-chain (host runtime / dispatcher), and neither attacks.md nor proxy.md documents the on-chain flag as an "off-chain hint only", so there is no design basis for the chain ignoring it. The omission is an oversight, not an intentional access-control boundary.

Fix

  • CreateDevshardEscrow: after loading DevshardEscrowParams, return an error ("devshard requests are currently disabled") when DevshardRequestsEnabled is false, before any funds are locked.
  • SettleDevshardEscrow: after loading DevshardEscrowParams (and the existing nil-guard), return the same error when the flag is false, before signature verification and before any payouts/refunds move module funds.
  • The default genesis value (DefaultDevshardRequestsEnabled = true) is unchanged, so normal operation is unaffected; only an explicit Guardian "disable" now actually halts on-chain escrow creation and settlement.

Why This Closes The Vulnerability

The kill switch is a safety control, not an access control. Before this change, the only effect of disabling devshard was that off-chain hosts stopped serving, while the chain kept accepting fund-locking escrows and executing settlements — meaning during the exact incident an operator would flip the switch for (e.g. a suspected settlement exploit), funds could still be locked and drained from the module. After this change, both fund-moving entry points read the flag and reject early, so a disabled switch freezes the subsystem on-chain: no new funds are locked and no settlement can move module balances. Funds in existing escrows remain safely locked in place and become settleable again the moment the switch is re-enabled, so the freeze is temporary and governance-controlled rather than a permanent lock.

Test plan

  • go test ./x/inference/keeper/... in inference-chain (Linux + cgo toolchain).
  • TestCreateDevshardEscrow_RequestsDisabledBlocks — with allowlist open but DevshardRequestsEnabled=false, creation is rejected with "devshard requests are currently disabled" and no SendCoinsFromAccountToModule mock is invoked (funds are never locked).
  • TestSettleDevshardEscrow_RequestsDisabledBlocks — with a stored escrow, allowlist open, and DevshardRequestsEnabled=false, settlement is rejected with the same error before any bank transfer.
  • Existing happy-path create/settle tests still pass (default params keep the switch on); TestCreateDevshardEscrow_ParamsOverrideDefaults now sets DevshardRequestsEnabled: true in its custom params struct.

The Guardian-gated MsgSetDevshardRequestsEnabled previously only wrote the
DevshardRequestsEnabled flag to params; neither CreateDevshardEscrow nor
SettleDevshardEscrow ever read it, so flipping the switch off did nothing
at the chain level. Users could still lock funds and settlements still
moved module balances during the exact incident the switch exists for.

Both handlers now read DevshardEscrowParams.DevshardRequestsEnabled and
reject early with "devshard requests are currently disabled" before any
SendCoinsFromAccountToModule / payout / refund call. Default value
(DefaultDevshardRequestsEnabled = true) is unchanged, so only an explicit
Guardian "disable" now halts on-chain escrow create and settle.

Tests:
- TestCreateDevshardEscrow_RequestsDisabledBlocks
- TestSettleDevshardEscrow_RequestsDisabledBlocks
- TestCreateDevshardEscrow_ParamsOverrideDefaults updated to set the
  flag explicitly in its custom params struct
@a-kuprin

a-kuprin commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

The main point that devshard is still for experiments and creation of escrows is whitelisted.

DevshardRequestsEnabled is controlled by guardians and currently it is centralization. But it is ok, because the main goal is not for production, but to have some switch to turn off inferences while devshard gateway experiments.

So preventing escrows from creation doesn't seem to be good, because enyway only core team now can create escrows, and the team can switch DevshardRequestsEnabled to false. So there is no clear vulnarability this pr fixes.

Even if preventing from devshard creation when DevshardRequestsEnabled = false has some logic, I think, preventing from settlement very unclean goal.

Overall I think, it doesn't add significant value

@ouicate

ouicate commented Jun 1, 2026

Copy link
Copy Markdown
Author

The main point that devshard is still for experiments and creation of escrows is whitelisted.

DevshardRequestsEnabled is controlled by guardians and currently it is centralization. But it is ok, because the main goal is not for production, but to have some switch to turn off inferences while devshard gateway experiments.

So preventing escrows from creation doesn't seem to be good, because enyway only core team now can create escrows, and the team can switch DevshardRequestsEnabled to false. So there is no clear vulnarability this pr fixes.

Even if preventing from devshard creation when DevshardRequestsEnabled = false has some logic, I think, preventing from settlement very unclean goal.

Overall I think, it doesn't add significant value

I want to separate the create-side argument (where I think the framing is off) from the settle-side argument (where you have a real point).

1. Today the chain reads the flag in zero handlers

SetDevshardRequestsEnabled writes the param and returns. I greppped inference-chain/ for DevshardRequestsEnabled: outside the generated .pb.go/.pulsar.go and the setter itself, nothing reads it before this PR. CreateDevshardEscrow and SettleDevshardEscrow load DevshardEscrowParams for MaxAmount/MaxEscrowsPerEpoch/MaxNonce but skip this one field. So the message is a write-only param toggle at chain level — regardless of who is allowed to call create.

2. The proposal that shipped this message calls it an "emergency switch"

The v0.2.13 governance README, which introduced MsgSetDevshardRequestsEnabled:

It also adds a guardian-controlled emergency switch for disabling devshard
inference requests.
- Adds `MsgSetDevshardRequestsEnabled`, a guardian-signed transaction for
  emergency disabling and re-enabling devshard inference requests.

I also checked all devshard/docs/ — none of them mention DevshardRequestsEnabled. So the proposal markets an "emergency switch" and no design doc says the chain intentionally ignores it.

3. Allowlist ≠ kill switch

(a) Default code path is open — empty allowlist means everyone passes:

func (k Keeper) IsAllowedEscrowCreator(ctx context.Context, address string) bool {
	ep := k.GetDevshardEscrowParams(ctx)
	if len(ep.AllowedCreatorAddresses) == 0 {
		return true
	}
	...
}

So any deployment that hasn't run the v0.2.11/v0.2.13 upgrade in order are wide open.

(b) Mainnet isn't "just the core team". v0.2.13 adds external operators on top of the v0.2.11 list — final mainnet list is 17 addresses, including:

var devshardAllowedCreatorAddressesToAdd = []string{
	// Gonka Labs
	"gonka1r2s0rwgskp6y4ed7qr7d25qdwjwlvpp6demv90",
	// Hyperfusion
	"gonka1ls8wqecwj369du8s2t9a223xu9sgvmzlw2ye9c",
	// 6Blocks
	"gonka10wmset95nhgfjt4wklsyjqpx55m40zy3gha2pn",
	//  https://gonkabroker.com/,
	"gonka17ld2g62230w0erzexefzw03sw0adtuchr425rp",
}

Per-call exposure: v0.2.11 set MaxAmount = 100 GNK, v0.2.13 raised MaxEscrowsPerEpoch to 500_000. Any one of those 17 keys (including 3rd-party brokers) being compromised, leaked, or buggy is precisely the scenario a guardian-flippable kill switch exists for.

These are two different controls: allowlist = authorization (who may call). DevshardRequestsEnabled = safety (can the subsystem be hot-stopped regardless of who may call). The chain currently enforces the first and ignores the second.

4. "Off-chain hosts already enforce it" — not reliably

(a) devshardd polls on a 60-second ticker, not block events:

	go func() {
		ticker := time.NewTicker(60 * time.Second)
		defer ticker.Stop()
		for {
			select {
			case <-ctx.Done():
				return
			case <-ticker.C:
				refresh()
			}
		}
	}()

So during an incident, hosts can keep serving for up to a minute after the flag is flipped.

(b) dapi fails open on params query errors. availability.Record(...) only fires in the else branch:

		params, err := d.queryClient.Params(ctx, &types.QueryParamsRequest{})
		if err != nil {
			logging.Error("Failed to get params", types.Validation, "error", err)
		} else {
			...
			if d.availability != nil {
				d.availability.Record(
					params.Params.DevshardEscrowParams.DevshardRequestsEnabled,
					...
				)
			}
		}

Startup value is hard-coded to enabled=true (NewAvailabilityTracker(true, 0, 0) in both decentralized-api/main.go:104 and cmd/devshardd/main.go:140). An RPC blip during an incident leaves hosts believing devshard is enabled.

(c) Even when the off-chain block engages, it covers a narrow set:

func requestBlockedWhenUnavailable(req HostRequest) bool {
	if req.Payload != nil {
		return true
	}
	for _, diff := range req.Diffs {
		for _, tx := range diff.Txs {
			if tx.GetStartInference() != nil ||
				tx.GetTimeoutInference() != nil ||
				tx.GetValidation() != nil ||
				tx.GetValidationVote() != nil {
				return true
			}
		}
	}
	return false
}

An honest host with the flag off still co-signs settlement flows. And a malicious host can simply ignore completionRequestsEnabled() — there is no chain check forcing them to honor it. For a control gating module-account flows, that's the wrong layer.

5. The settle-side concern — you have a real point

Your strongest objection is the settle half, and I want to be honest about it. The unsettled-escrow pruner runs every block from EndBlock (x/inference/keeper/pruning.go:52), with a 2-epoch threshold, and pays the full deposit to validators with no creator refund:

const DevshardPruningThreshold = uint64(2)
...
func (k Keeper) distributeUnsettledEscrow(ctx context.Context, escrow types.DevshardEscrow) error {
	...
	share := escrow.Amount / uniqueCount
		Remover: func(ctx context.Context, key collections.Pair[uint64, uint64]) error {
			...
			escrow, found := k.GetDevshardEscrow(ctx, escrowID)
			if found && !escrow.Settled {
				if err := k.distributeUnsettledEscrow(ctx, escrow); err != nil {
					...
				}
			}
			if err := k.DevshardEscrows.Remove(ctx, escrowID); err != nil {

So if a Guardian keeps the switch off for ≥2 epochs, the settle-side block converts every in-flight escrow into a 100%-to-validators payout at prune time — exactly the "kill switch strands user funds" failure mode. Your intuition on settle lands here.

Proposal

  • Keep the create-side check. That's the one that closes the "lock funds into a disabled subsystem" path on every deployment regardless of allowlist state, and matches the "emergency switch" semantics the proposal advertised.
  • Drop the settle-side check (or narrow it to refund-only / current-epoch-only) and document in the handler that settle is intentionally left open while disabled, so users can drain in-flight escrows before the 2-epoch pruner confiscates them.

Happy to push that change in this PR if @gmorgachev agree — the create-side gate is independently worth landing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants