Skip to content

chore(upgrade): clean leftover state in v0.2.14#1287

Open
Ryanchen911 wants to merge 6 commits into
gonka-ai:upgrade-v0.2.14from
Ryanchen911:ryan/1223-v0.2.14-state-cleanup
Open

chore(upgrade): clean leftover state in v0.2.14#1287
Ryanchen911 wants to merge 6 commits into
gonka-ai:upgrade-v0.2.14from
Ryanchen911:ryan/1223-v0.2.14-state-cleanup

Conversation

@Ryanchen911

@Ryanchen911 Ryanchen911 commented Jun 1, 2026

Copy link
Copy Markdown

Summary

  • clean leftover inference module state during the v0.2.14 upgrade
  • re-run legacy epoch-group, top miner, training, and PoC v2 cleanup paths idempotently
  • include the missing TrainingTaskKvRecordKeyPrefix in training cleanup

Closes #1223

Tests

  • go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_14
  • go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_12 ./x/inference/keeper

Identified leftovers

  • Legacy EpochGroupValidationsMap: replaced by per-inference EpochGroupValidationEntry in v0.2.11; this cleanup migrates any remaining current/previous epoch entries and clears the old aggregate map.
  • TopMiners: cleared in v0.2.12 and no longer used by live paths.
  • Training state: training task state is removed; cleanup now also includes the previously omitted TrainingTaskKvRecordKeyPrefix.
  • Legacy PoC v2 prefixes: replaced by model-aware prefixes 58/59/60 in v0.2.12; old raw prefixes are cleared idempotently.

Copilot AI review requested due to automatic review settings June 1, 2026 03:56
@Ryanchen911 Ryanchen911 force-pushed the ryan/1223-v0.2.14-state-cleanup branch from 81a3dbc to b51d8b1 Compare June 1, 2026 04:04
@tcharchian tcharchian added this to the v0.2.14 milestone Jun 1, 2026
@tcharchian tcharchian requested a review from patimen June 1, 2026 21:58
@tcharchian tcharchian linked an issue Jun 1, 2026 that may be closed by this pull request
@patimen

patimen commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

/run-integration

@gmorgachev

Copy link
Copy Markdown
Contributor

@Ryanchen911

i think this task should include analysis amount of state / state history used by prefix. then we check what can be removed. the state size is quite big still, we need to understand why

Adds an offline `inferenced state-stats` command that reports per-store and
per-inference-prefix committed state size, with legacy prefixes flagged as
cleanup candidates. Backed by a StatePrefixCatalog single-source-of-truth that
maps every inference prefix to a readable name.

Addresses the review request to analyze state size by prefix before deciding
what to remove (issue gonka-ai#1223).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Ryanchen911

Copy link
Copy Markdown
Author

@gmorgachev good point — agreed that we should drive removal from measured per-prefix size, not just remove the prefixes we already know are dead.

I added an offline analysis command for exactly this: inferenced state-stats (see docs/state-stats.md).

What it does:

  • opens a stopped node's application.db (or a restored snapshot), loads the latest committed height (or --height), and iterates every module KV store;
  • prints a per-store size summary (keys / key bytes / value bytes / total), so we can see which module dominates;
  • for the inference module, attributes every key to a named prefix via a new types.StatePrefixCatalog (single source of truth mapping each prefix in keys.go to a readable label) and flags legacy prefixes — the cleanup candidates;
  • --legacy-only and --top N to focus the view.

So the workflow to answer "why is state big / what else can we drop":

  1. run state-stats on a current mainnet snapshot → see the biggest prefixes;
  2. anything large + legacy is already removed by this PR's v0.2.14 cleanup (EpochGroupValidations aggregate map, TopMiner, training state, legacy PoC v2);
  3. anything large + non-legacy that looks prunable becomes a follow-up cleanup task, decided from the numbers.

I don't have a mainnet DB locally, so I can't paste the actual breakdown here. If someone with access to a node/snapshot can run inferenced state-stats --home <stopped-node-home> and drop the output here, we can decide on scope: keep this PR as the known-legacy cleanup + analysis tooling, and open a separate issue for any newly-identified large prefixes.

@patimen

patimen commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

/run-integration

@patimen

patimen commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@Ryanchen911 To run that, we'd need to have the new binary running on main-net, wouldn't we? Couldn't we use inferenced export instead? Or is there a problem with that, it seems to have issues locally...

@Ryanchen911

Copy link
Copy Markdown
Author

@patimen no mainnet deployment needed — state-stats is an offline, read-only command. It opens the DB exclusively, so you just run it once against a snapshot or a copy of a node's data dir (node stopped).

I think inferenced export won't answer Gleb's question, unfortunately:
export only emits the logical genesis each module's ExportGenesis writes (params, participants, models, bridge…). The leftover/index/cache prefixes we actually want to measure — TopMiner, training state, legacy PoC v2, the various indexes — are not in the export at all, so they'd be invisible.

If running the branch binary against a snapshot is too much friction for this PR, I think we can split it: merge the known-legacy cleanup now, and track the per-prefix size analysis (Gleb's ask) as a separate task where ops can run state-stats on a snapshot whenever convenient. Either way works.

@patimen

patimen commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

I have run this on mainnet... there is no way we're going to be able to do this kind of large scale pruning in an upgrade handler. Deletion is not cheap in 0.53.3 of Cosmos, it will take a very long time by my estimation (10-30 minutes) for each major component we need to clear.
We are going to need to use gradual pruning, either using our existing pruner pruner.go or something else in order to remove state. For reference, here is the list of what's taking up our state, as of height 4520003:

Store Size Breakdown

Store Keys Key Bytes Value Bytes Total Size
inference 78,179,591 8,174,257,889 17,398,241,842 23.8 GiB
bls 42,230 1,698,869 623,477,291 596.2 MiB
acc 2,275,754 47,806,043 121,778,453 161.7 MiB
staking 54,829 878,207 100,549,500 96.7 MiB
bank 2,251,747 64,189,672 1,218,968 62.4 MiB
authz 188,724 15,835,938 26,864,448 40.7 MiB
group 303,560 11,812,693 11,910,734 22.6 MiB
streamvesting 2,169 45,529 5,471,088 5.3 MiB
wasm 20,995 1,298,958 1,293,529 2.5 MiB
feegrant 12,971 745,881 1,134,762 1.8 MiB
slashing 6,263 144,245 473,258 603.0 KiB
distribution 13,176 395,506 111,319 494.9 KiB
gov 98 1,281 320,722 314.5 KiB
ibc 3,098 178,254 106,744 278.3 KiB
collateral 3,720 106,730 8,249 112.3 KiB
evidence 45 1,485 5,175 6.5 KiB
genesistransfer 38 1,349 4,777 6.0 KiB
capability 11 245 893 1.1 KiB
transfer 13 480 402 882 B
upgrade 48 495 300 795 B
icahost 4 238 79 317 B
mint 2 2 149 151 B
consensus 1 9 49 58 B
restrictions 2 38 5 43 B
crisis 1 1 14 15 B
bookkeeper 1 12 0 12 B
icacontroller 1 6 2 8 B
params 0 0 0 0 B
nft 0 0 0 0 B
circuit 0 0 0 0 B
feeibc 0 0 0 0 B
TOTAL 83,359,092 8,319,400,055 18,292,972,752 24.8 GiB

Inference Prefix Breakdown

Prefix Keys Key Bytes Value Bytes Total Size
PoCBatch 14,774,601 975,123,666 5,740,211,563 6.3 GiB
StatsDevelopersByInferenceAndModel 14,367,831 2,134,050,738 2,210,474,926 4.0 GiB
InferenceValidationDetails 14,957,767 1,450,903,399 2,874,536,849 4.0 GiB
StatsDevelopersByTime 10,167,512 1,739,727,404 1,457,006,992 3.0 GiB
StatsDevelopersByInference 10,167,512 1,159,096,368 1,526,209,652 2.5 GiB
PoCValidation 12,749,190 637,459,500 1,627,479,855 2.1 GiB
StatsDevelopersByEpoch 1,606 123,618 991,613,039 945.8 MiB
Inferences 638,653 56,840,117 872,791,024 886.6 MiB
PoCValidationV2 177,533 15,064,581 24,128,878 37.4 MiB
EpochGroupData 1,132 32,622 29,671,320 28.3 MiB
PreservedNodesSnapshot 293 9,929 22,246,426 21.2 MiB
RandomSeed 67,802 1,966,258 12,177,327 13.5 MiB
EpochPerformanceSummary 74,123 2,223,690 4,330,514 6.3 MiB
<unmatched:0x48> 4,393 311,903 1,630,109 1.9 MiB
Participants 6,867 144,207 1,274,964 1.4 MiB
MLNodeWeightDistribution 6,814 418,296 946,122 1.3 MiB
PoCV2StoreCommit 6,814 418,296 863,658 1.2 MiB
ExcludedParticipants 3,728 108,112 293,006 391.7 KiB
DevshardEscrows 264 2,376 235,290 232.1 KiB
BridgeTransactionValidators 1,365 120,120 0 117.3 KiB
ConfirmationPoCEvents 602 10,234 49,379 58.2 KiB
PoCDelegation 227 16,122 27,018 42.1 KiB
ParticipantAllowList 1,793 37,653 0 36.8 KiB
InferencesToPrune 293 28,421 0 27.8 KiB
BridgeMintRefunds 86 5,590 13,526 18.7 KiB
DelegationSnapshot 1 1 11,650 11.4 KiB
BridgeTransactions 32 1,376 7,888 9.0 KiB
Epochs 293 2,637 2,073 4.6 KiB
DevshardEscrowsByEpoch 264 4,488 0 4.4 KiB
Params 1 11 3,441 3.4 KiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] Clean up the state

4 participants