Skip to content

Critical Bug: pod could potentially be deleted before it can produce upgrade-info.json for upgrade #490

@allthatjazzleo

Description

@allthatjazzleo

Summary

We got this error when running the scheduled upgrade for a node:

we have this part in our CosmosFullNode CRD in spec.chain.versions

    versions:
      # Genesis version
      - height: 0
        image: ghcr.io/mantra-chain/mantrachain:v1.0.3
      - height: 3103100
        image: ghcr.io/mantra-chain/mantrachain:v2.0.3
      - height: 3833000
        image: ghcr.io/mantra-chain/mantrachain:v3.0.3
      - height: 4428500
        image: ghcr.io/mantra-chain/mantrachain:v4.0.0

Then, during the upgrade, small portion of the nodes got this error:

panic: error loading last version: failed to load latest version: version of store icahost mismatch root store's version; expected 4428499 got 0 

After some investigation, we found that the node was deleted before it could produce the upgrade-info.json file. This upgrade-info.json is dumped when v3.0.3 is trying to run the preblock of 4428500 which is implemented here. Later, most of chain in the upgrade binary will read the upgrade-info.json for the SetStoreLoader like noble's implementation.

Bug in cosmos-operator

There is a cache controller which run every 5s to grab the latest version of the node via RPC. This is then later consumed by CosmosFullNodeReconciler to update the status of CosmosFullNode CRD in every 60s. There is a rare case that the node just completed the upgrade_height-1 commit and somehow the cache controller just grab this upgrade_height-1 from pod. And then, coincidentally, the CosmosFullNodeReconciler is starting to run the next reconcile which will update the status of the node to the latest version. This CRD's status changes will cause the next reconcile immediately. In the second reconcile, pod_control will delete the pod because the sync height of status of the pod is equal to the crd.Spec.ChainSpec.Versions.UpgradeHeight This caused the pod to be deleted before it can run the preblock of UpgradeHeight in old binary to produce the upgrade-info.json file. The upgrade-info.json file is needed for the next version of the node to start up correctly.

In our case, only a small portion of the nodes got this error because it required a specific timing of all above events to occur, leading to the premature deletion of the pod.

Solution

I propose to add a default 5s delay in updateStatus only if version.UpgradeHeight == status.Height for that pod. This will help ensure that the pod has enough time to produce the upgrade-info.json file before any deletion occurs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions