devshard v2 (v0.2.13-devshard-v2)#1289
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Sets DevshardEscrowParams.MaxEscrowsPerEpoch to 500_000.
Skip startup only when the port is set negative; treat 0 as unset and fall back to 9400. Wire the same default into the join compose file via NODE_MANAGER_GRPC_PORT so devshard reaches the API without manual config.
A participant restored to ACTIVE inherited the prior ConsecutiveInvalidInferences, so a single new failure could re-invalidate them immediately. Zero the counter when transitioning to INVALID and at every upcoming-to-effective promotion.
Replace the hardcoded keeper.DevshardMaxNonce constant with a governance parameter on DevshardEscrowParams. VerifyDevshardSettlement now receives the bound from params; the settle msg server reads it before verifying. The v0.2.13 upgrade handler raises MaxNonce to 1_000_000 and bundles the existing MaxEscrowsPerEpoch=500_000 bump into the same step.
…2.13 v0.2.12 added MsgRespondDealerComplaints to InferenceOperationKeyPerms but did not migrate existing cold-to-warm grants, leaving pre-v0.2.12 DAPIs unable to respond to dealer complaints. Walk authz grants, key each pair off its MsgStartInference grant, and add the missing authorization with the source grant's expiration. Idempotent.
Wire CreateUpgradeHandler with InferenceKeeper and AuthzKeeper so the chain runs the v0.2.13 migrations at the upgrade height. No module ConsensusVersion bump: the handler edits existing collections, no inference store schema change.
# Devshard storage: Postgres backend + epoch pruning
Drop-in replacement for the unbounded single-file SQLite store on `main`.
SQLite-only deployments need no config change; new binaries auto-migrate
the legacy DB on first boot.
## Architecture
```
HostManager
-> ManagedStorage // 30s pruner, retain N=3 epochs
-> SQLite // PGHOST unset
-> HybridStorage // PGHOST set
-> Postgres // primary, sticky per-escrow
-> SQLite // local fallback while PG is down
```
Storage is partitioned by `epoch_id` (= `DevshardEscrow.epoch_index`):
- Postgres: `devshard_sessions`, `devshard_diffs`, `devshard_signatures`
each `PARTITION BY RANGE (epoch_id)`. Partitions are created lazily;
pruning is `DROP TABLE`.
- SQLite: one `epoch_<N>.db` per epoch plus a `_meta.db` routing index;
pruning closes the pool and removes the file.
- Hybrid: per-escrow stickiness keeps a session on one backend.
`ManagedStorage` ticks every 30s, computes
`cutoff = max_observed_epoch + 1 - retain`, and prunes everything older.
An `EpochProvider` advances the cutoff on quiet hosts.
## Drop-in guarantees
- `PGHOST` unset -> SQLite-only, identical to before.
- `PGHOST` set -> hybrid mode, same env vars as `payloadstorage`.
- Legacy `/root/.dapi/data/devshard.db` is migrated to
`/root/.dapi/data/devshard/` on first boot, then renamed
`*.migrated.<unix>`. Idempotent across restarts.
- Per-host storage. No schema, proto, HTTP, or gossip changes.
## Tradeoffs
For simplicity, partitioning is by `epoch_id` only, not
`(epoch_id, escrow_id)`. Loading a session reads its diffs from the
shared epoch partition (indexed on `escrow_id`). The next step is per-escrow state snapshots (data +
additions) so readers skip the diff replay.
…poch Reuses the v0.2.10 grace-epoch primitive with UpgradeProtectionWindow=3000.
The pruning test queried latestEpoch at the very end and asserted that its session partition existed. But the advance-epochs loop exits via waitForNextEpoch after the last write, so by the time the assertion runs the chain's current epoch has no devshard activity and therefore no partition. Capture the epochIndex of the last tick's escrow during the loop and assert against that partition instead.
Problem:
API startup waited for devshard legacy migration and full session replay before
starting the ML/admin servers. On large devshard state this delayed port 9100 by
minutes even though most endpoints did not need recovered devshard sessions.
Solution:
Gate devshard session routes with a 503 initializing response, run legacy
migration in the background, then mark devshard ready and recover sessions
asynchronously. Requests after migration still lazily recover a single escrow
before serving it.
Flow:
startup -> register gated routes -> start servers
-> migrate legacy DB -> mark ready -> background recovery
request -> ready? no -> 503 initializing
request -> ready? yes -> session cached? yes -> serve
request -> ready? yes -> session cached? no -> recover escrow -> serve
* devshard snapshots for hosts * devshards recoversessions parallel workers * devshard host snapshot on settlement --------- Co-authored-by: David and Daniil Liberman <da@liberman.net>
|
/run-integration |
|
/run-integration |
If genesys node in test got <= 2 slots we cannot observe validated inferences correctly before settlement
Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com> Signed-off-by: a-kuprin <instig@mail.ru>
Co-authored-by: Daniil Yankouski <yankouski.daniil@gmail.com> Signed-off-by: a-kuprin <instig@mail.ru>
…faults. Disable Jaeger and Grafana public UI routes by default, require Jaeger basic auth and a strong Grafana admin password before the proxy will expose /jaeger/ or /grafana/, and document the setup in join config and observability docs.
Secure observability UI proxying with Jaeger basic auth and opt-in de…
|
Added #1326 that fixes found issue: Hosts could diverge from the user on SealedAcc / post_state_root because sealing used a wall-clock grace gate outside the signed diff |
* Move devshard inference sealing into deterministic state-machine auto-seal. Host-local wall-clock prune tiers made seal timing node-dependent and risked diverging state roots. Fold eligible inferences during diff apply using nonce and ConfirmedAt-derived state clock gates, and have the host emit payload-prune events only after the machine seals them. * Added short path for sealing inference: if inference is validated/invalidated don't wait grace period and seal it immidiately. Additional check before sealing inference has one of following statuses: StatusFinished, StatusValidated, StatusInvalidated, StatusTimedOut --------- Co-authored-by: akup <ak@neonavigation.com>
|
the nonce isn't bound to real work. in that's the same counter the downtime punishment reads ( create/settle is permissionless by default ( not prescribing a fix since that's your design, but the root is using the nonce-derived upper bound as the actual completed count — binding the credit to signed per-slot completed work (or cross-checking against |
|
two more verification gaps in the v2 runtime this PR ships — both the same "sibling verifies, twin doesn't" shape, and i've got fixes open against main for each:
both are still present on |
|
I've seen both and they are candidates for next release in 1 or 2 weeks. We just need to make this release finite |
* Parameters naming and inferenceSealGraceNonce, inferenceSealGraceTimeout moved to EscrowStart * Don't seal inferences when stateClock is undefined (no confirmedAt value in latest inferences)
It is at escrow start message and unchangable during escrow session Default is 150. It is required for e2e testermint test pass. That test checking autodealing works
| @@ -4,7 +4,7 @@ go 1.24.2 | |||
|
|
|||
| replace ( | |||
| cosmossdk.io/store => github.com/gonka-ai/cosmos-sdk/store v1.1.2-ps1 | |||
| github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17 | |||
| github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability | |||
There was a problem hiding this comment.
Are we planning to make this include as a stable version, instead of a feature branch?
| @@ -788,8 +790,8 @@ github.com/golangci/revgrep v0.5.3 h1:3tL7c1XBMtWHHqVpS5ChmiAAoe4PF/d5+ULzV9sLAz | |||
| github.com/golangci/revgrep v0.5.3/go.mod h1:U4R/s9dlXZsg8uJmaR1GrloUr14D7qDl8gi2iPXJH8k= | |||
| github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed h1:IURFTjxeTfNFP0hTEi1YKjB/ub8zkpaOqFFMApi2EAs= | |||
| github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed/go.mod h1:XLXN8bNw4CGRPaqgl3bv/lhz7bsGPh4/xSaMTbo2vkQ= | |||
| github.com/gonka-ai/cosmos-sdk v0.53.3-ps17 h1:xw8ssDJDfl+/TnD9QMq/EZGzjnoh+6cvROqZE/MwNzU= | |||
| github.com/gonka-ai/cosmos-sdk v0.53.3-ps17/go.mod h1:90S054hIbadFB1MlXVZVC5w0QbKfd1P4b79zT+vvJxw= | |||
| github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability h1:vWph4b1Xzvwj9jV3BVD6RXQLqRmCsGNyPAxePlFIU0Q= | |||
There was a problem hiding this comment.
Are we planning to make this include as a stable version, instead of a feature branch?
There was a problem hiding this comment.
stable version, not a feature branch.
Do you have any concerns on this?
Basically in devshard But again you are right that we should add |
|
yeah fair @a-kuprin , the active-phase fee bounds it so it's not free like i implied, my bad. |
This PR prepares the devshard v2 release.
This is the first devshard-only upgrade, which operates independently of usual chain upgrades. Once approved, v2 will run in parallel with the existing v1 devshard runtime.
See the upgrade design doc and the versioned/ package for details.
Upgrade process
devsharddbinary as a Gonka release artifactDevshardEscrowParams.approved_versions(defining the name, binary download URL, and sha256 hash)versiondautomatically downloads the binary and serves it under the/devshard/v2prefix/devshard/v2is available, contributors can test it before gateways switch primary traffic to v2No manual host steps are expected during this type of upgrade.
devshard
MsgFinishInferencetransactions so the sequencer can pick them up from another host's mempooldecentralized-api
The changes in the
decentralized-api/module are fully backward compatible and do not need to be activated before the next mainnet release.GetRuntimeConfiggRPC long-pollinference-chain
The changes in the
inference-chain/module are wire-compatible and do not need to be activated before the next mainnet release.versionfield tostate_root_and_protocol_versionin the devshard settlement message protoDevshardEscrowParamscreate_devshard_feeandfee_per_noncetoDevshardEscrowto snapshot active fees at escrow creationdeploy
Proposed Bounties