Skip to content

devshard v2 (v0.2.13-devshard-v2)#1289

Open
a-kuprin wants to merge 135 commits into
mainfrom
devshard-0.2.13-v2
Open

devshard v2 (v0.2.13-devshard-v2)#1289
a-kuprin wants to merge 135 commits into
mainfrom
devshard-0.2.13-v2

Conversation

@a-kuprin

@a-kuprin a-kuprin commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

This PR prepares the devshard v2 release.

This is the first devshard-only upgrade, which operates independently of usual chain upgrades. Once approved, v2 will run in parallel with the existing v1 devshard runtime.

See the upgrade design doc and the versioned/ package for details.

Upgrade process

  • Release the devshardd binary as a Gonka release artifact
  • Submit a governance proposal to register the new supported version in DevshardEscrowParams.approved_versions (defining the name, binary download URL, and sha256 hash)
  • If the proposal is approved, versiond automatically downloads the binary and serves it under the /devshard/v2 prefix
  • Once /devshard/v2 is available, contributors can test it before gateways switch primary traffic to v2

No manual host steps are expected during this type of upgrade.

devshard

  • Prune old epoch storage on epoch changes, move SQLite/Postgres schema setup out of hot paths, and select exactly one storage backend per process
  • Remove the seed reveal round, seal completed inference stats, and prune payloads so long-running sessions do not keep all served inferences in RAM or state
  • Re-gossip stale MsgFinishInference transactions so the sequencer can pick them up from another host's mempool
  • Enforce the governance-controlled maximum nonce limit on hosts to reject invalid requests before settlement
  • Separate devshard runtime version from state-root protocol version and stamp protocol v2 at build time
  • Create sessions from on-chain escrow fee snapshots and runtime config instead of hardcoded values (with direct chain fallback until mainnet has the matching NodeManager runtime-config endpoint)
  • Store per-inference validation counters outside the state root in SQLite/Postgres and expose per-slot totals through devshard stats endpoints after inference pruning
  • Add internal devshard traces and metrics through OpenTelemetry and Prometheus
  • Return typed devshard errors for disabled, initializing, and non-retryable states instead of generic failures

decentralized-api

The changes in the decentralized-api/ module are fully backward compatible and do not need to be activated before the next mainnet release.

  • Serve chain-backed devshard runtime config through the NodeManager GetRuntimeConfig gRPC long-poll
  • Add dapi traces and metrics for public inference requests, event listening, validation, chain queries, transaction broadcasts, and ML node calls
  • Propagate trace context across executor forwarding, validation payload fetches, and ML node calls

inference-chain

The changes in the inference-chain/ module are wire-compatible and do not need to be activated before the next mainnet release.

  • Rename the version field to state_root_and_protocol_version in the devshard settlement message proto
  • Move devshard session timeouts, fees, validation rates, vote threshold factor, and grace periods to governance-controlled DevshardEscrowParams
  • Add create_devshard_fee and fee_per_nonce to DevshardEscrow to snapshot active fees at escrow creation

deploy

  • Add join-stack observability with Grafana, Jaeger, Prometheus, Loki, Promtail, and cAdvisor
  • Add dashboards for devshard sessions, chain health, query latency, storage, containers, and node health

Proposed Bounties

Bounty ID Sum USDT Bounty Explanation GitHub ID
PR #1114, PR #1115 3000 Certik security audit fixes (GEB-62, GEB-59, GEB-60), reported in Issue #1109 @x0152
Issue #1135 30000 PoC Decode. So far, PoC validation has only covered the prefill step, but most of the real computation in inference happens during decode, which goes unverified. PoC-decode extends it to every decode step, so a node running a different/cheaper model gets caught. It closes the biggest open gap in the network's PoC validation mechanism. spec Axel-t
PR #1035 100 fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout @unameisfine
PR #1298 17000 Devshard 0.2.13 v2 - release implementation and management @akup
PR #1046 4000 Observability implementation @qdanik
PR #1046 2000 Observability implementation @blizko
branch 7000 Emergency troubleshooting @qdanik
-- 3000 Gateway - implementation work @qdanik
report 7000 Emergency troubleshooting - schema bomb and B200 investigation kaitaku.ai
MiniMax, Additional benchm 10000 MiniMax integration + post-deploy bug-fixing + additional benchmarks + community FAQ kaitaku.ai
Issue #1026 5000 VLM inference and validation in Gonka - testing VLM serving validation and adding the necessary tools/scripts (inference + validation for visual language models, threshold calibration across honest/fraud scenarios) @fedor-konovalenko, MIL team
Issue #34 5000 TOPLOC as a validation mechanism. Evaluated using topic to reduce artifact size. The original paper reported near-100% accuracy, but only on small models (Llama-8B); Experiment results matched the paper for small models, while accuracy dropped on large models (235B). @fedor-konovalenko, MIL team
docs#1093, docs#1134, docs#992, docs#1094 500 docs: restructure governance section and expand guidance; add MiniMax-M2.7 and Kimi K2.6 model licenses; update host hardware specifications @Dolper

akup and others added 30 commits February 25, 2026 18:19
Co-authored-by: Cursor <cursoragent@cursor.com>
Sets DevshardEscrowParams.MaxEscrowsPerEpoch to 500_000.
Skip startup only when the port is set negative; treat 0 as unset and
fall back to 9400. Wire the same default into the join compose file via
NODE_MANAGER_GRPC_PORT so devshard reaches the API without manual config.
A participant restored to ACTIVE inherited the prior ConsecutiveInvalidInferences,
so a single new failure could re-invalidate them immediately. Zero the counter
when transitioning to INVALID and at every upcoming-to-effective promotion.
Replace the hardcoded keeper.DevshardMaxNonce constant with a governance
parameter on DevshardEscrowParams. VerifyDevshardSettlement now receives
the bound from params; the settle msg server reads it before verifying.
The v0.2.13 upgrade handler raises MaxNonce to 1_000_000 and bundles the
existing MaxEscrowsPerEpoch=500_000 bump into the same step.
…2.13

v0.2.12 added MsgRespondDealerComplaints to InferenceOperationKeyPerms
but did not migrate existing cold-to-warm grants, leaving pre-v0.2.12
DAPIs unable to respond to dealer complaints. Walk authz grants, key
each pair off its MsgStartInference grant, and add the missing
authorization with the source grant's expiration. Idempotent.
Wire CreateUpgradeHandler with InferenceKeeper and AuthzKeeper so the
chain runs the v0.2.13 migrations at the upgrade height. No module
ConsensusVersion bump: the handler edits existing collections, no
inference store schema change.
# Devshard storage: Postgres backend + epoch pruning

Drop-in replacement for the unbounded single-file SQLite store on `main`.
SQLite-only deployments need no config change; new binaries auto-migrate
the legacy DB on first boot.

## Architecture

```
HostManager
  -> ManagedStorage           // 30s pruner, retain N=3 epochs
       -> SQLite              // PGHOST unset
       -> HybridStorage       // PGHOST set
            -> Postgres       // primary, sticky per-escrow
            -> SQLite         // local fallback while PG is down
```

Storage is partitioned by `epoch_id` (= `DevshardEscrow.epoch_index`):

- Postgres: `devshard_sessions`, `devshard_diffs`, `devshard_signatures`
  each `PARTITION BY RANGE (epoch_id)`. Partitions are created lazily;
  pruning is `DROP TABLE`.
- SQLite: one `epoch_<N>.db` per epoch plus a `_meta.db` routing index;
  pruning closes the pool and removes the file.
- Hybrid: per-escrow stickiness keeps a session on one backend.

`ManagedStorage` ticks every 30s, computes
`cutoff = max_observed_epoch + 1 - retain`, and prunes everything older.
An `EpochProvider` advances the cutoff on quiet hosts.

## Drop-in guarantees

- `PGHOST` unset -> SQLite-only, identical to before.
- `PGHOST` set -> hybrid mode, same env vars as `payloadstorage`.
- Legacy `/root/.dapi/data/devshard.db` is migrated to
  `/root/.dapi/data/devshard/` on first boot, then renamed
  `*.migrated.<unix>`. Idempotent across restarts.
- Per-host storage. No schema, proto, HTTP, or gossip changes.

## Tradeoffs

For simplicity, partitioning is by `epoch_id` only, not
`(epoch_id, escrow_id)`. Loading a session reads its diffs from the
shared epoch partition (indexed on `escrow_id`). The next step is per-escrow state snapshots (data +
additions) so readers skip the diff replay.
…poch

Reuses the v0.2.10 grace-epoch primitive with UpgradeProtectionWindow=3000.
The pruning test queried latestEpoch at the very end and asserted that
its session partition existed. But the advance-epochs loop exits via
waitForNextEpoch after the last write, so by the time the assertion
runs the chain's current epoch has no devshard activity and therefore
no partition. Capture the epochIndex of the last tick's escrow during
the loop and assert against that partition instead.
Problem:
API startup waited for devshard legacy migration and full session replay before
starting the ML/admin servers. On large devshard state this delayed port 9100 by
minutes even though most endpoints did not need recovered devshard sessions.

Solution:
Gate devshard session routes with a 503 initializing response, run legacy
migration in the background, then mark devshard ready and recover sessions
asynchronously. Requests after migration still lazily recover a single escrow
before serving it.

Flow:
startup -> register gated routes -> start servers
        -> migrate legacy DB -> mark ready -> background recovery

request -> ready? no -> 503 initializing
request -> ready? yes -> session cached? yes -> serve
request -> ready? yes -> session cached? no -> recover escrow -> serve
* devshard snapshots for hosts

* devshards recoversessions parallel workers

* devshard host snapshot on settlement

---------

Co-authored-by: David and Daniil Liberman <da@liberman.net>
@a-kuprin

a-kuprin commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/run-integration

@a-kuprin

a-kuprin commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

/run-integration

akup and others added 4 commits June 7, 2026 13:30
If genesys node in test got <= 2 slots we cannot observe validated inferences correctly before settlement
Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com>
Signed-off-by: a-kuprin <instig@mail.ru>
Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com>
Signed-off-by: a-kuprin <instig@mail.ru>
…faults.

Disable Jaeger and Grafana public UI routes by default, require Jaeger
basic auth and a strong Grafana admin password before the proxy will expose
/jaeger/ or /grafana/, and document the setup in join config and observability docs.
Comment thread deploy/join/docker-compose.yml Outdated
@a-kuprin

a-kuprin commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Added #1326 that fixes found issue:

Hosts could diverge from the user on SealedAcc / post_state_root because sealing used a wall-clock grace gate outside the signed diff

a-kuprin and others added 4 commits June 9, 2026 21:16
* Move devshard inference sealing into deterministic state-machine auto-seal.

Host-local wall-clock prune tiers made seal timing node-dependent and risked diverging state roots. Fold eligible inferences during diff apply using nonce and ConfirmedAt-derived state clock gates, and have the host emit payload-prune events only after the machine seals them.

* Added short path for sealing inference:
if inference is validated/invalidated don't wait grace period and seal it immidiately.

Additional check before sealing inference has one of following statuses:
StatusFinished, StatusValidated, StatusInvalidated, StatusTimedOut

---------

Co-authored-by: akup <ak@neonavigation.com>
@0xMayoor

Copy link
Copy Markdown
Contributor

devshardAssignedUpperBoundForSlot (devshard_settlement.go) is documented as "the maximum number of inference IDs that could have been assigned to a slot" — an upper bound, 1 + (nonce-firstAssigned)/slotCount. but the settle handler uses it as the actual completed count: assignedToSlot, _ := devshardAssignedUpperBoundForSlot(msg.Nonce, ...)AggregateDevshardHostStatsIntoCurrentEpochStats(participant, *hs, assignedToSlot), which credits completed = assignedPerSlot - missed straight into CurrentEpochStats.InferenceCount. so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

the nonce isn't bound to real work. in applyCore (devshard/state/machine.go) an empty diff (or MsgFinalizeRound) advances LatestNonce with no StartInference, and the per-nonce fee is only charged in the Active phase — so once you're in Finalizing/Settlement you can advance the nonce up to the max for free. the new host-side max-nonce limit caps the magnitude (~MaxNonce/groupSize per slot, ~1250 at the defaults) but doesn't change that the count is decoupled from work. hosts still sign those empty roots — the only acceptance checker withholds on a stale mempool, not on an inference-less diff — and HostStats.Missed/Cost stay 0 since nothing finished or timed out. so an all-zero HostStats settlement at a high nonce is a valid quorum-signed payload, and each occupied slot's participant gets credited ~1250 "completed".

that's the same counter the downtime punishment reads (accountsettle.go, total = InferenceCount + MissedRequests). a participant who's genuinely down — say 50 served / 50 missed, normally zeroed by MissedStatTest — can settle one max-nonce escrow, fabricate ~1250 completed, drop their apparent miss-rate under p0, and keep the full reward. the same counters also feed getDynamicP0, so a large zero-missed contribution pulls the network-wide baseline down and tightens p0 for everyone.

create/settle is permissionless by default (AllowedCreatorAddresses empty) and slots are sampled from the epoch group, so any active participant can land a slot — one is enough. i have a small go test that runs the real devshardAssignedUpperBoundForSlotAggregateDevshardHostStatsIntoCurrentEpochStatsCheckAndPunishForDowntime path and shows that same 50/50 participant flip from reward 0 to full reward; happy to share.

not prescribing a fix since that's your design, but the root is using the nonce-derived upper bound as the actual completed count — binding the credit to signed per-slot completed work (or cross-checking against Cost/validations at settle) would close it.

@0xMayoor

Copy link
Copy Markdown
Contributor

two more verification gaps in the v2 runtime this PR ships — both the same "sibling verifies, twin doesn't" shape, and i've got fixes open against main for each:

fetchSignature (devshard/user/session.go) stores the bytes a host returns from GET /signatures keyed by slot, with only a slot-ownership check and no RecoverAddress — so a host can hand back arbitrary bytes that then get counted toward quorum. its sibling processResponse recovers and matches the address before storing. fix: #1311.

HandleGossipTxs (devshard/transport/server.go) forwards gossiped txs into the mempool after only a group-membership check, with no per-tx proposer-sig verification — so a group member can inject forged txs the host then trusts (e.g. a forged validation vote that suppresses the host's own validation via the mempool oracle). its sibling HandleGossipNonce does RecoverAddress + slot match before storing. fix: #1312.

both are still present on devshard-0.2.13-v2 at the current head — flagging here since they ride along in the code under review.

@a-kuprin

Copy link
Copy Markdown
Collaborator Author

@0xMayoor

I've seen both and they are candidates for next release in 1 or 2 weeks. We just need to make this release finite

a-kuprin and others added 5 commits June 11, 2026 21:34
* Parameters naming and inferenceSealGraceNonce, inferenceSealGraceTimeout moved to EscrowStart
* Don't seal inferences when stateClock is undefined (no confirmedAt value in latest inferences)
It is at escrow start message and unchangable during escrow session
Default is 150.
It is required for e2e testermint test pass. That test checking autodealing works
Comment thread devshard/devshardctl Outdated
Comment thread inference-chain/go.mod
@@ -4,7 +4,7 @@ go 1.24.2

replace (
cosmossdk.io/store => github.com/gonka-ai/cosmos-sdk/store v1.1.2-ps1
github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17
github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to make this include as a stable version, instead of a feature branch?

Comment thread inference-chain/go.sum
@@ -788,8 +790,8 @@ github.com/golangci/revgrep v0.5.3 h1:3tL7c1XBMtWHHqVpS5ChmiAAoe4PF/d5+ULzV9sLAz
github.com/golangci/revgrep v0.5.3/go.mod h1:U4R/s9dlXZsg8uJmaR1GrloUr14D7qDl8gi2iPXJH8k=
github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed h1:IURFTjxeTfNFP0hTEi1YKjB/ub8zkpaOqFFMApi2EAs=
github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed/go.mod h1:XLXN8bNw4CGRPaqgl3bv/lhz7bsGPh4/xSaMTbo2vkQ=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17 h1:xw8ssDJDfl+/TnD9QMq/EZGzjnoh+6cvROqZE/MwNzU=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17/go.mod h1:90S054hIbadFB1MlXVZVC5w0QbKfd1P4b79zT+vvJxw=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability h1:vWph4b1Xzvwj9jV3BVD6RXQLqRmCsGNyPAxePlFIU0Q=

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to make this include as a stable version, instead of a feature branch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stable version, not a feature branch.
Do you have any concerns on this?

@a-kuprin

a-kuprin commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

@0xMayoor

so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

Basically in devshard nonceId == inferenceId, but you are right on that there is service nonces like one carrying MsgFinalizeRound.
devshard is designed to serve a lot of inderences, so this doesn't break the stats.

But again you are right that we should add - 1

@0xMayoor

Copy link
Copy Markdown
Contributor

yeah fair @a-kuprin , the active-phase fee bounds it so it's not free like i implied, my bad.
the gap's bigger than -1 though — once finalizing starts the nonce keeps advancing with no fee till
LatestNonce >= FinalizeNonce +len(Group), so it's the whole finalize window not one service nonce.
and that count lands in CurrentEpochStats.InferenceCount which feeds the downtime punishment denom and the dynamicP0 baseline, so it shifts the miss-rate test a bit, not just a display stat.
might be small in normal runs, you'd know better — figured worth subtracting the window not just 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants