devshard v2 (v0.2.13-devshard-v2) by a-kuprin · Pull Request #1289 · gonka-ai/gonka

a-kuprin · 2026-06-01T15:14:46Z

This PR prepares the devshard v2 release.

This is the first devshard-only upgrade, which operates independently of usual chain upgrades. Once approved, v2 will run in parallel with the existing v1 devshard runtime.

See the upgrade design doc and the versioned/ package for details.

Upgrade process

Release the devshardd binary as a Gonka release artifact
Submit a governance proposal to register the new supported version in DevshardEscrowParams.approved_versions (defining the name, binary download URL, and sha256 hash)
If the proposal is approved, versiond automatically downloads the binary and serves it under the /devshard/v2 prefix
Once /devshard/v2 is available, contributors can test it before gateways switch primary traffic to v2

No manual host steps are expected during this type of upgrade.

devshard

Prune old epoch storage on epoch changes, move SQLite/Postgres schema setup out of hot paths, and select exactly one storage backend per process
Remove the seed reveal round, seal completed inference stats, and prune payloads so long-running sessions do not keep all served inferences in RAM or state
Re-gossip stale MsgFinishInference transactions so the sequencer can pick them up from another host's mempool
Enforce the governance-controlled maximum nonce limit on hosts to reject invalid requests before settlement
Separate devshard runtime version from state-root protocol version and stamp protocol v2 at build time
Create sessions from on-chain escrow fee snapshots and runtime config instead of hardcoded values (with direct chain fallback until mainnet has the matching NodeManager runtime-config endpoint)
Store per-inference validation counters outside the state root in SQLite/Postgres and expose per-slot totals through devshard stats endpoints after inference pruning
Add internal devshard traces and metrics through OpenTelemetry and Prometheus
Return typed devshard errors for disabled, initializing, and non-retryable states instead of generic failures

decentralized-api

The changes in the decentralized-api/ module are fully backward compatible and do not need to be activated before the next mainnet release.

Serve chain-backed devshard runtime config through the NodeManager GetRuntimeConfig gRPC long-poll
Add dapi traces and metrics for public inference requests, event listening, validation, chain queries, transaction broadcasts, and ML node calls
Propagate trace context across executor forwarding, validation payload fetches, and ML node calls

inference-chain

The changes in the inference-chain/ module are wire-compatible and do not need to be activated before the next mainnet release.

Rename the version field to state_root_and_protocol_version in the devshard settlement message proto
Move devshard session timeouts, fees, validation rates, vote threshold factor, and grace periods to governance-controlled DevshardEscrowParams
Add create_devshard_fee and fee_per_nonce to DevshardEscrow to snapshot active fees at escrow creation

deploy

Add join-stack observability with Grafana, Jaeger, Prometheus, Loki, Promtail, and cAdvisor
Add dashboards for devshard sessions, chain health, query latency, storage, containers, and node health

Proposed Bounties

Bounty ID	Sum USDT	Bounty Explanation	GitHub ID
PR #1114, PR #1115	3000	Certik security audit fixes (GEB-62, GEB-59, GEB-60), reported in Issue #1109	@x0152
Issue #1135	30000	PoC Decode. So far, PoC validation has only covered the prefill step, but most of the real computation in inference happens during decode, which goes unverified. PoC-decode extends it to every decode step, so a node running a different/cheaper model gets caught. It closes the biggest open gap in the network's PoC validation mechanism. spec	Axel-t
PR #1035	100	fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout	@unameisfine
PR #1298	17000	Devshard 0.2.13 v2 - release implementation and management	@akup
PR #1046	4000	Observability implementation	@qdanik
PR #1046	2000	Observability implementation	@blizko
branch	7000	Emergency troubleshooting	@qdanik
--	3000	Gateway - implementation work	@qdanik
report	7000	Emergency troubleshooting - schema bomb and B200 investigation	kaitaku.ai
MiniMax, Additional benchm	10000	MiniMax integration + post-deploy bug-fixing + additional benchmarks + community FAQ	kaitaku.ai
Issue #1026	5000	VLM inference and validation in Gonka - testing VLM serving validation and adding the necessary tools/scripts (inference + validation for visual language models, threshold calibration across honest/fraud scenarios)	@fedor-konovalenko, MIL team
Issue #34	5000	TOPLOC as a validation mechanism. Evaluated using topic to reduce artifact size. The original paper reported near-100% accuracy, but only on small models (Llama-8B); Experiment results matched the paper for small models, while accuracy dropped on large models (235B).	@fedor-konovalenko, MIL team
docs#1093, docs#1134, docs#992, docs#1094	500	docs: restructure governance section and expand guidance; add MiniMax-M2.7 and Kimi K2.6 model licenses; update host hardware specifications	@Dolper

Co-authored-by: Cursor <cursoragent@cursor.com>

Sets DevshardEscrowParams.MaxEscrowsPerEpoch to 500_000.

Skip startup only when the port is set negative; treat 0 as unset and fall back to 9400. Wire the same default into the join compose file via NODE_MANAGER_GRPC_PORT so devshard reaches the API without manual config.

A participant restored to ACTIVE inherited the prior ConsecutiveInvalidInferences, so a single new failure could re-invalidate them immediately. Zero the counter when transitioning to INVALID and at every upcoming-to-effective promotion.

Replace the hardcoded keeper.DevshardMaxNonce constant with a governance parameter on DevshardEscrowParams. VerifyDevshardSettlement now receives the bound from params; the settle msg server reads it before verifying. The v0.2.13 upgrade handler raises MaxNonce to 1_000_000 and bundles the existing MaxEscrowsPerEpoch=500_000 bump into the same step.

…2.13 v0.2.12 added MsgRespondDealerComplaints to InferenceOperationKeyPerms but did not migrate existing cold-to-warm grants, leaving pre-v0.2.12 DAPIs unable to respond to dealer complaints. Walk authz grants, key each pair off its MsgStartInference grant, and add the missing authorization with the source grant's expiration. Idempotent.

Wire CreateUpgradeHandler with InferenceKeeper and AuthzKeeper so the chain runs the v0.2.13 migrations at the upgrade height. No module ConsensusVersion bump: the handler edits existing collections, no inference store schema change.

# Devshard storage: Postgres backend + epoch pruning Drop-in replacement for the unbounded single-file SQLite store on `main`. SQLite-only deployments need no config change; new binaries auto-migrate the legacy DB on first boot. ## Architecture ``` HostManager -> ManagedStorage // 30s pruner, retain N=3 epochs -> SQLite // PGHOST unset -> HybridStorage // PGHOST set -> Postgres // primary, sticky per-escrow -> SQLite // local fallback while PG is down ``` Storage is partitioned by `epoch_id` (= `DevshardEscrow.epoch_index`): - Postgres: `devshard_sessions`, `devshard_diffs`, `devshard_signatures` each `PARTITION BY RANGE (epoch_id)`. Partitions are created lazily; pruning is `DROP TABLE`. - SQLite: one `epoch_<N>.db` per epoch plus a `_meta.db` routing index; pruning closes the pool and removes the file. - Hybrid: per-escrow stickiness keeps a session on one backend. `ManagedStorage` ticks every 30s, computes `cutoff = max_observed_epoch + 1 - retain`, and prunes everything older. An `EpochProvider` advances the cutoff on quiet hosts. ## Drop-in guarantees - `PGHOST` unset -> SQLite-only, identical to before. - `PGHOST` set -> hybrid mode, same env vars as `payloadstorage`. - Legacy `/root/.dapi/data/devshard.db` is migrated to `/root/.dapi/data/devshard/` on first boot, then renamed `*.migrated.<unix>`. Idempotent across restarts. - Per-host storage. No schema, proto, HTTP, or gossip changes. ## Tradeoffs For simplicity, partitioning is by `epoch_id` only, not `(epoch_id, escrow_id)`. Loading a session reads its diffs from the shared epoch partition (indexed on `escrow_id`). The next step is per-escrow state snapshots (data + additions) so readers skip the diff replay.

…poch Reuses the v0.2.10 grace-epoch primitive with UpgradeProtectionWindow=3000.

The pruning test queried latestEpoch at the very end and asserted that its session partition existed. But the advance-epochs loop exits via waitForNextEpoch after the last write, so by the time the assertion runs the chain's current epoch has no devshard activity and therefore no partition. Capture the epochIndex of the last tick's escrow during the loop and assert against that partition instead.

Problem: API startup waited for devshard legacy migration and full session replay before starting the ML/admin servers. On large devshard state this delayed port 9100 by minutes even though most endpoints did not need recovered devshard sessions. Solution: Gate devshard session routes with a 503 initializing response, run legacy migration in the background, then mark devshard ready and recover sessions asynchronously. Requests after migration still lazily recover a single escrow before serving it. Flow: startup -> register gated routes -> start servers -> migrate legacy DB -> mark ready -> background recovery request -> ready? no -> 503 initializing request -> ready? yes -> session cached? yes -> serve request -> ready? yes -> session cached? no -> recover escrow -> serve

* devshard snapshots for hosts * devshards recoversessions parallel workers * devshard host snapshot on settlement --------- Co-authored-by: David and Daniil Liberman <da@liberman.net>

a-kuprin · 2026-06-06T22:32:42Z

/run-integration

a-kuprin · 2026-06-07T06:30:12Z

/run-integration

If genesys node in test got <= 2 slots we cannot observe validated inferences correctly before settlement

Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com> Signed-off-by: a-kuprin <instig@mail.ru>

…faults. Disable Jaeger and Grafana public UI routes by default, require Jaeger basic auth and a strong Grafana admin password before the proxy will expose /jaeger/ or /grafana/, and document the setup in join config and observability docs.

Secure observability UI proxying with Jaeger basic auth and opt-in de…

a-kuprin · 2026-06-09T10:01:37Z

Added #1326 that fixes found issue:

Hosts could diverge from the user on SealedAcc / post_state_root because sealing used a wall-clock grace gate outside the signed diff

* Move devshard inference sealing into deterministic state-machine auto-seal. Host-local wall-clock prune tiers made seal timing node-dependent and risked diverging state roots. Fold eligible inferences during diff apply using nonce and ConfirmedAt-derived state clock gates, and have the host emit payload-prune events only after the machine seals them. * Added short path for sealing inference: if inference is validated/invalidated don't wait grace period and seal it immidiately. Additional check before sealing inference has one of following statuses: StatusFinished, StatusValidated, StatusInvalidated, StatusTimedOut --------- Co-authored-by: akup <ak@neonavigation.com>

…ead code

0xMayoor · 2026-06-11T13:01:40Z

devshardAssignedUpperBoundForSlot (devshard_settlement.go) is documented as "the maximum number of inference IDs that could have been assigned to a slot" — an upper bound, 1 + (nonce-firstAssigned)/slotCount. but the settle handler uses it as the actual completed count: assignedToSlot, _ := devshardAssignedUpperBoundForSlot(msg.Nonce, ...) → AggregateDevshardHostStatsIntoCurrentEpochStats(participant, *hs, assignedToSlot), which credits completed = assignedPerSlot - missed straight into CurrentEpochStats.InferenceCount. so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

the nonce isn't bound to real work. in applyCore (devshard/state/machine.go) an empty diff (or MsgFinalizeRound) advances LatestNonce with no StartInference, and the per-nonce fee is only charged in the Active phase — so once you're in Finalizing/Settlement you can advance the nonce up to the max for free. the new host-side max-nonce limit caps the magnitude (~MaxNonce/groupSize per slot, ~1250 at the defaults) but doesn't change that the count is decoupled from work. hosts still sign those empty roots — the only acceptance checker withholds on a stale mempool, not on an inference-less diff — and HostStats.Missed/Cost stay 0 since nothing finished or timed out. so an all-zero HostStats settlement at a high nonce is a valid quorum-signed payload, and each occupied slot's participant gets credited ~1250 "completed".

that's the same counter the downtime punishment reads (accountsettle.go, total = InferenceCount + MissedRequests). a participant who's genuinely down — say 50 served / 50 missed, normally zeroed by MissedStatTest — can settle one max-nonce escrow, fabricate ~1250 completed, drop their apparent miss-rate under p0, and keep the full reward. the same counters also feed getDynamicP0, so a large zero-missed contribution pulls the network-wide baseline down and tightens p0 for everyone.

create/settle is permissionless by default (AllowedCreatorAddresses empty) and slots are sampled from the epoch group, so any active participant can land a slot — one is enough. i have a small go test that runs the real devshardAssignedUpperBoundForSlot → AggregateDevshardHostStatsIntoCurrentEpochStats → CheckAndPunishForDowntime path and shows that same 50/50 participant flip from reward 0 to full reward; happy to share.

not prescribing a fix since that's your design, but the root is using the nonce-derived upper bound as the actual completed count — binding the credit to signed per-slot completed work (or cross-checking against Cost/validations at settle) would close it.

0xMayoor · 2026-06-11T13:08:18Z

two more verification gaps in the v2 runtime this PR ships — both the same "sibling verifies, twin doesn't" shape, and i've got fixes open against main for each:

fetchSignature (devshard/user/session.go) stores the bytes a host returns from GET /signatures keyed by slot, with only a slot-ownership check and no RecoverAddress — so a host can hand back arbitrary bytes that then get counted toward quorum. its sibling processResponse recovers and matches the address before storing. fix: #1311.

HandleGossipTxs (devshard/transport/server.go) forwards gossiped txs into the mempool after only a group-membership check, with no per-tx proposer-sig verification — so a group member can inject forged txs the host then trusts (e.g. a forged validation vote that suppresses the host's own validation via the mempool oracle). its sibling HandleGossipNonce does RecoverAddress + slot match before storing. fix: #1312.

both are still present on devshard-0.2.13-v2 at the current head — flagging here since they ride along in the code under review.

a-kuprin · 2026-06-11T15:34:03Z

@0xMayoor

I've seen both and they are candidates for next release in 1 or 2 weeks. We just need to make this release finite

* Parameters naming and inferenceSealGraceNonce, inferenceSealGraceTimeout moved to EscrowStart * Don't seal inferences when stateClock is undefined (no confirmedAt value in latest inferences)

It is at escrow start message and unchangable during escrow session Default is 150. It is required for e2e testermint test pass. That test checking autodealing works

blizko · 2026-06-12T13:44:22Z

@@ -4,7 +4,7 @@ go 1.24.2

 replace (
 	cosmossdk.io/store => github.com/gonka-ai/cosmos-sdk/store v1.1.2-ps1
-	github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17
+	github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability


Are we planning to make this include as a stable version, instead of a feature branch?

blizko · 2026-06-12T13:44:48Z

@@ -788,8 +790,8 @@ github.com/golangci/revgrep v0.5.3 h1:3tL7c1XBMtWHHqVpS5ChmiAAoe4PF/d5+ULzV9sLAz
 github.com/golangci/revgrep v0.5.3/go.mod h1:U4R/s9dlXZsg8uJmaR1GrloUr14D7qDl8gi2iPXJH8k=
 github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed h1:IURFTjxeTfNFP0hTEi1YKjB/ub8zkpaOqFFMApi2EAs=
 github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed/go.mod h1:XLXN8bNw4CGRPaqgl3bv/lhz7bsGPh4/xSaMTbo2vkQ=
-github.com/gonka-ai/cosmos-sdk v0.53.3-ps17 h1:xw8ssDJDfl+/TnD9QMq/EZGzjnoh+6cvROqZE/MwNzU=
-github.com/gonka-ai/cosmos-sdk v0.53.3-ps17/go.mod h1:90S054hIbadFB1MlXVZVC5w0QbKfd1P4b79zT+vvJxw=
+github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability h1:vWph4b1Xzvwj9jV3BVD6RXQLqRmCsGNyPAxePlFIU0Q=


Are we planning to make this include as a stable version, instead of a feature branch?

stable version, not a feature branch.
Do you have any concerns on this?

a-kuprin · 2026-06-12T14:18:59Z

@0xMayoor

so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

Basically in devshard nonceId == inferenceId, but you are right on that there is service nonces like one carrying MsgFinalizeRound.
devshard is designed to serve a lot of inderences, so this doesn't break the stats.

But again you are right that we should add - 1

0xMayoor · 2026-06-12T17:01:58Z

yeah fair @a-kuprin , the active-phase fee bounds it so it's not free like i implied, my bad.
the gap's bigger than -1 though — once finalizing starts the nonce keeps advancing with no fee till
LatestNonce >= FinalizeNonce +len(Group), so it's the whole finalize window not one service nonce.
and that count lands in CurrentEpochStats.InferenceCount which feeds the downtime punishment denom and the dynamicP0 baseline, so it shifts the miss-rate test a bit, not just a display stat.
might be small in normal runs, you'd know better — figured worth subtracting the window not just 1.

akup and others added 30 commits February 25, 2026 18:19

Run one block and exit

1879119

Add docker-build workflow (from test-782-783)

f71fb96

Co-authored-by: Cursor <cursoragent@cursor.com>

replace binary

726250f

Fixing go.mod

3a8dbcd

Build with custom cosmos sdk

c68b287

chore(upgrade): scaffold v0.2.13 handler and governance artifacts

fabde5b

Sets DevshardEscrowParams.MaxEscrowsPerEpoch to 500_000.

feat(api): default node manager gRPC port to 9400

562d4c9

Skip startup only when the port is set negative; treat 0 as unset and fall back to 9400. Wire the same default into the join compose file via NODE_MANAGER_GRPC_PORT so devshard reaches the API without manual config.

Fix: non-eligible models confirmation poc

5b54ef3

refactoring

f90f9e7

corner case

c4e8edd

chore(upgrade): register v0.2.13 upgrade handler

28614f7

Wire CreateUpgradeHandler with InferenceKeeper and AuthzKeeper so the chain runs the v0.2.13 migrations at the upgrade height. No module ConsensusVersion bump: the handler edits existing collections, no inference store schema change.

fix(poc): skip confirmation penalties without evidence

e59a07e

Desctiption

109eba7

refactor(inference): reuse confirmation weight coefficients

80c0902

fix(upgrade): skip confirmation PoC for the rest of v0.2.13 upgrade e…

4b2cd6a

…poch Reuses the v0.2.10 grace-epoch primitive with UpgradeProtectionWindow=3000.

Minor fixes

bd4b0b4

Add ldflgs back

b27618d

Snapshot for devshards (#1149)

b3accc9

* devshard snapshots for hosts * devshards recoversessions parallel workers * devshard host snapshot on settlement --------- Co-authored-by: David and Daniil Liberman <da@liberman.net>

recover duration

975647c

verions on conflicts

f25426e

check before create escrow

048535a

stats

b2a6714

Set version in CICD

e2e319f

Simplify

cc86a87

Fix in-test devshardd version

491e500

akup and others added 4 commits June 7, 2026 13:30

Fix flacky test.

781c523

If genesys node in test got <= 2 slots we cannot observe validated inferences correctly before settlement

Update proxy/nginx.unified.conf.template

29edf76

Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com> Signed-off-by: a-kuprin <instig@mail.ru>

Update proxy/nginx.unified.conf.template

a35139e

Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com> Signed-off-by: a-kuprin <instig@mail.ru>

blizko reviewed Jun 8, 2026

View reviewed changes

Comment thread deploy/join/docker-compose.yml Outdated

a-kuprin and others added 3 commits June 8, 2026 18:49

Merge pull request #1320 from a-kuprin/devshard-0.2.13-v2-obs-protection

1e73426

Secure observability UI proxying with Jaeger basic auth and opt-in de…

CONFIG_instrumentation__prometheus set by env var

9df75e6

inference timeout

78991b6

a-kuprin and others added 4 commits June 9, 2026 21:16

Fixing flaky test

8b9e0fe

Cleanup, as gateway already has more compex solution, and there was d…

a3cb052

…ead code

Quote in entrypoint.sh that broked start

1d180f9

a-kuprin and others added 5 commits June 11, 2026 21:34

Using min confirmedAt instead of max (debugging) (#1330)

7af5384

* Parameters naming and inferenceSealGraceNonce, inferenceSealGraceTimeout moved to EscrowStart * Don't seal inferences when stateClock is undefined (no confirmedAt value in latest inferences)

Remove ,string from the two uint32 grace tags

09d7f0c

AutoSealEveryNNonces is also passed as governance parameter.

aa17f7a

It is at escrow start message and unchangable during escrow session Default is 150. It is required for e2e testermint test pass. That test checking autodealing works

Fixing flaky tests

84a2783

NodeDisableInferenceTests adjust inference window

cd1172d

blizko reviewed Jun 12, 2026

View reviewed changes

Comment thread devshard/devshardctl Outdated

blizko reviewed Jun 12, 2026

View reviewed changes

Add devshardctl, devshardd binaries to gitignore

77220ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

devshard v2 (v0.2.13-devshard-v2)#1289

devshard v2 (v0.2.13-devshard-v2)#1289
a-kuprin wants to merge 135 commits into
mainfrom
devshard-0.2.13-v2

a-kuprin commented Jun 1, 2026 •

edited by mtvnastya

Loading

Uh oh!

a-kuprin commented Jun 6, 2026

Uh oh!

a-kuprin commented Jun 7, 2026

Uh oh!

Uh oh!

a-kuprin commented Jun 9, 2026

Uh oh!

0xMayoor commented Jun 11, 2026

Uh oh!

0xMayoor commented Jun 11, 2026

Uh oh!

a-kuprin commented Jun 11, 2026

Uh oh!

Uh oh!

blizko Jun 12, 2026

Uh oh!

blizko Jun 12, 2026

Uh oh!

a-kuprin Jun 12, 2026

Uh oh!

a-kuprin commented Jun 12, 2026 •

edited

Loading

Uh oh!

0xMayoor commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

a-kuprin commented Jun 1, 2026 • edited by mtvnastya Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Upgrade process

devshard

decentralized-api

inference-chain

deploy

Proposed Bounties

Uh oh!

a-kuprin commented Jun 6, 2026

Uh oh!

a-kuprin commented Jun 7, 2026

Uh oh!

Uh oh!

a-kuprin commented Jun 9, 2026

Uh oh!

0xMayoor commented Jun 11, 2026

Uh oh!

0xMayoor commented Jun 11, 2026

Uh oh!

a-kuprin commented Jun 11, 2026

Uh oh!

Uh oh!

blizko Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

blizko Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

a-kuprin Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

a-kuprin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xMayoor commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

a-kuprin commented Jun 1, 2026 •

edited by mtvnastya

Loading

a-kuprin commented Jun 12, 2026 •

edited

Loading