-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Summary
We got this error when running the scheduled upgrade for a node:
we have this part in our CosmosFullNode CRD in spec.chain.versions
versions:
# Genesis version
- height: 0
image: ghcr.io/mantra-chain/mantrachain:v1.0.3
- height: 3103100
image: ghcr.io/mantra-chain/mantrachain:v2.0.3
- height: 3833000
image: ghcr.io/mantra-chain/mantrachain:v3.0.3
- height: 4428500
image: ghcr.io/mantra-chain/mantrachain:v4.0.0Then, during the upgrade, small portion of the nodes got this error:
panic: error loading last version: failed to load latest version: version of store icahost mismatch root store's version; expected 4428499 got 0
After some investigation, we found that the node was deleted before it could produce the upgrade-info.json file. This upgrade-info.json is dumped when v3.0.3 is trying to run the preblock of 4428500 which is implemented here. Later, most of chain in the upgrade binary will read the upgrade-info.json for the SetStoreLoader like noble's implementation.
Bug in cosmos-operator
There is a cache controller which run every 5s to grab the latest version of the node via RPC. This is then later consumed by CosmosFullNodeReconciler to update the status of CosmosFullNode CRD in every 60s. There is a rare case that the node just completed the upgrade_height-1 commit and somehow the cache controller just grab this upgrade_height-1 from pod. And then, coincidentally, the CosmosFullNodeReconciler is starting to run the next reconcile which will update the status of the node to the latest version. This CRD's status changes will cause the next reconcile immediately. In the second reconcile, pod_control will delete the pod because the sync height of status of the pod is equal to the crd.Spec.ChainSpec.Versions.UpgradeHeight This caused the pod to be deleted before it can run the preblock of UpgradeHeight in old binary to produce the upgrade-info.json file. The upgrade-info.json file is needed for the next version of the node to start up correctly.
In our case, only a small portion of the nodes got this error because it required a specific timing of all above events to occur, leading to the premature deletion of the pod.
Solution
I propose to add a default 5s delay in updateStatus only if version.UpgradeHeight == status.Height for that pod. This will help ensure that the pod has enough time to produce the upgrade-info.json file before any deletion occurs.