Skip to content

RKE2 datastore bootstrap extract may fail due to copying in-use etcd db files #9427

@brandond

Description

@brandond

Environmental Info:
RKE2 Version: n/a

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
n/a

Describe the bug:

When RKE2 starts up, it creates a copy of the etcd DB files, and starts a temporary single-node etcd cluster with TLS disabled using the temp files, in order to extract the bootstrap data - which is needed to talk to the 'normal' etcd, which runs with TLS enabled.

On K3s this works fine since etcd can never be running if k3s is not running - as it runs in the main k3s process. However on RKE2, the etcd pod may still be running, which means that copies of the etcd db files may contain unsync'ed data, which causes the temporary etcd startup to fail:

Nov  7 21:56:46 vraldap1359644 rke2[3335886]: {"msg":"opened backend db","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"1.963400582s"}
Nov  7 21:56:48 vraldap1359644 rke2[3335886]: {"msg":"recovered v2 store from snapshot","snapshot-index":113352078,"snapshot-size":"15 kB"}
Nov  7 21:56:48 vraldap1359644 rke2[3335886]: {"msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":113352078,"snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/0000000006c19d8e.snap.db","error":"snap: snapshot file doesn't exist"}   

Steps To Reproduce:
This could happen any time rke2 is restarted, but it seems to be more likely to reproduce during cluster upgrades.

Expected behavior:
RKE2 is consistently able to start up without errors from temporary etcd.

Actual behavior:
RKE2 sometimes fails to start with missing snapshot errors.

Additional context / logs:
SURE-11067

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions