Critical Bug: pod could potentially be deleted before it can produce upgrade-info.json for upgrade

# Summary

We got this error when running the scheduled upgrade for a node:

we have this part in our CosmosFullNode CRD in `spec.chain.versions`
```yaml
    versions:
      # Genesis version
      - height: 0
        image: ghcr.io/mantra-chain/mantrachain:v1.0.3
      - height: 3103100
        image: ghcr.io/mantra-chain/mantrachain:v2.0.3
      - height: 3833000
        image: ghcr.io/mantra-chain/mantrachain:v3.0.3
      - height: 4428500
        image: ghcr.io/mantra-chain/mantrachain:v4.0.0
```

Then, during the upgrade, small portion of the nodes got this error:
```
panic: error loading last version: failed to load latest version: version of store icahost mismatch root store's version; expected 4428499 got 0 
```

After some investigation, we found that the node was deleted before it could [produce](https://github.com/cosmos/cosmos-sdk/blob/3037346f36bc1a5271ef5a81cdb74dc7bb6287b0/x/upgrade/keeper/keeper.go#L510-L528) the upgrade-info.json file. This upgrade-info.json is dumped when v3.0.3 is trying to run the preblock of 4428500 which is implemented [here](https://github.com/cosmos/cosmos-sdk/blob/3037346f36bc1a5271ef5a81cdb74dc7bb6287b0/x/upgrade/abci.go#L87-L100). Later, most of chain in the upgrade binary will read the upgrade-info.json for the SetStoreLoader like noble's [implementation](https://github.com/noble-assets/noble/blob/b8d19942328d93abd7e5ac7da4367f758d078d26/app.go#L466-L476).

# Bug in cosmos-operator

There is a cache controller which run every 5s to grab the latest version of the node via RPC. This is then later consumed by CosmosFullNodeReconciler to update the status of CosmosFullNode CRD in every 60s. There is a rare case that the node just completed the upgrade_height-1 commit and somehow the cache controller just grab this upgrade_height-1 from pod. And then, coincidentally, the CosmosFullNodeReconciler is starting to run the next reconcile which will update the status of the node to the latest version. This CRD's status changes will cause the next reconcile immediately. In the second reconcile, pod_control will delete the pod because [the sync height of status](https://github.com/strangelove-ventures/cosmos-operator/blob/36ebd8ce735133b8167742984a910da8ae74b3af/controllers/cosmosfullnode_controller.go#L242) of the pod is equal to the `crd.Spec.ChainSpec.Versions.UpgradeHeight` This caused the pod to be deleted before it can run the preblock of UpgradeHeight in old binary to produce the upgrade-info.json file. The upgrade-info.json file is needed for the next version of the node to start up correctly.

In our case, only a small portion of the nodes got this error because it required a specific timing of all above events to occur, leading to the premature deletion of the pod.

# Solution
I propose to add a default 5s delay in [updateStatus](https://github.com/strangelove-ventures/cosmos-operator/blob/36ebd8ce735133b8167742984a910da8ae74b3af/controllers/cosmosfullnode_controller.go#L225) only if version.UpgradeHeight == status.Height for that pod. This will help ensure that the pod has enough time to produce the upgrade-info.json file before any deletion occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Critical Bug: pod could potentially be deleted before it can produce upgrade-info.json for upgrade #490

Summary

Bug in cosmos-operator

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Critical Bug: pod could potentially be deleted before it can produce upgrade-info.json for upgrade #490

Description

Summary

Bug in cosmos-operator

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions