Skip to content

Commit 85d69cc

Browse files
fix: add v1.4.1 known issue: upgrade stuck in "Waiting Reboot" state (#715)
* fix: add v1.4.1 known issue: upgrade stuck in "Waiting Reboot" state Related issue: harvester/harvester#7457 Signed-off-by: Tim Serong <[email protected]> Co-authored-by: Jillian Maroket <[email protected]>
1 parent b1ae2f9 commit 85d69cc

File tree

2 files changed

+326
-0
lines changed

2 files changed

+326
-0
lines changed

docs/upgrade/v1-4-0-to-v1-4-1.md

+163
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,28 @@ An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvest
1414

1515
For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
1616

17+
:::info important
18+
19+
Check the disk usage of the operating system images on each node before starting the upgrade. To do this, access the node via SSH and run the command `du -sh /run/initramfs/cos-state/cOS/*`.
20+
21+
Example:
22+
23+
```
24+
# du -sh /run/initramfs/cos-state/cOS/*
25+
1.7G /run/initramfs/cos-state/cOS/active.img
26+
3.1G /run/initramfs/cos-state/cOS/passive.img
27+
```
28+
29+
If `passive.img` (which represents the previously installed Harvester v1.4.0 image) consumes 3.1G of disk space, run the following commands using the root account:
30+
31+
```
32+
# mount -o remount,rw /run/initramfs/cos-state
33+
# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
34+
# mount -o remount,ro /run/initramfs/cos-state
35+
```
36+
`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that each node has enough free space, preventing the upgrade process from becoming [stuck in the "Waiting Reboot" state](#3-upgrade-is-stuck-in-the-waiting-reboot-state).
37+
:::
38+
1739

1840
### Update Harvester UI Extension on Rancher v2.10.1
1941

@@ -144,3 +166,144 @@ You can perform any of the following workarounds:
144166
![Upgrade with another default storage class workaround](/img/v1.4/upgrade/upgrade-with-another-default-storage-class-workaround.png)
145167
146168
For more information, see [Issue #7375](https://github.com/harvester/harvester/issues/7375).
169+
170+
### 3. Upgrade is stuck in the "Waiting Reboot" state
171+
172+
The upgrade process may become stuck in the "Waiting Reboot" state after the Harvester v1.4.1 image is installed on a node and a reboot is initiated. At this point, the upgrade controller observes if the Harvester v1.4.1 operating system is running.
173+
174+
If the Harvester v1.4.1 image (hereafter referred to as `active.img`) fails to boot for any reason, the node automatically restarts in fallback mode and boots the previously installed Harvester v1.4.0 image (hereafter referred to as `passive.img`). The upgrade controller is unable to detect the expected operating system, so the upgrade remains stuck until an administrator fixes the problem with `active.img`.
175+
176+
`active.img` can become corrupted and unbootable because of insufficient disk space in the COS_STATE partition during the upgrade. This occurs if Harvester v1.4.0 was originally installed on the node and the system was configured to use a separate data disk. The issue does not occur in the following situations:
177+
178+
- The system has a single disk that is shared by the operating system and data.
179+
- An earlier Harvester version was originally installed and then later upgraded to v1.4.0.
180+
181+
To check if the issue exists in your environment, perform the following steps:
182+
183+
1. Access the node via SSH and log in using the root account.
184+
185+
1. Run the commands `cat /proc/cmdline` and `head -n1 /etc/harvester-release.yaml`.
186+
187+
Example:
188+
189+
```
190+
# cat /proc/cmdline
191+
BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/passive.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off upgrade_failure
192+
193+
# head -n1 /etc/harvester-release.yaml
194+
harvester: v1.4.0
195+
```
196+
197+
The presence of `cos-img/filename=/cOS/passive.img` and `upgrade_failure` in the output indicates that the system booted into fallback mode. The Harvester version in `/etc/harvester-release.yaml` confirms that the system is currently using the v1.4.0 image.
198+
199+
1. Check if `active.img` is corrupted by running the command `fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img`.
200+
201+
Example:
202+
203+
```
204+
# fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img
205+
e2fsck 1.46.4 (18-Aug-2021)
206+
Pass 1: Checking inodes, blocks, and sizes
207+
Pass 2: Checking directory structure
208+
209+
[...a list of various different errors may appear here...]
210+
211+
e2fsck: aborted
212+
213+
COS_ACTIVE: ********** WARNING: Filesystem still has errors **********
214+
```
215+
216+
1. Check the partition sizes by running the command `lsblk -o NAME,LABEL,SIZE`.
217+
218+
Example:
219+
220+
```
221+
# lsblk -o NAME,LABEL,SIZE
222+
NAME LABEL SIZE
223+
loop0 COS_ACTIVE 3G
224+
sr0 1024M
225+
vda 250G
226+
├─vda1 COS_GRUB 64M
227+
├─vda2 COS_OEM 64M
228+
├─vda3 COS_RECOVERY 4G
229+
├─vda4 COS_STATE 8G
230+
└─vda5 COS_PERSISTENT 237.9G
231+
vdb HARV_LH_DEFAULT 128G
232+
```
233+
234+
The output in the example shows a COS_STATE partition that is 8G in size. In this specific case, which involves an unsuccessful upgrade attempt and a corrupted `active.img`, the partition likely did not have enough free space for the upgrade to succeed.
235+
236+
To fix the issue, perform the following steps:
237+
238+
1. If your cluster has two or more nodes, access the remaining nodes via SSH and check the disk usage of `active.img` and `passive.img`.
239+
```
240+
# du -sh /run/initramfs/cos-state/cOS/*
241+
1.7G /run/initramfs/cos-state/cOS/active.img
242+
3.1G /run/initramfs/cos-state/cOS/passive.img
243+
```
244+
If `passive.img` consumes 3.1G of disk space, run the following commands using the root account:
245+
```
246+
# mount -o remount,rw /run/initramfs/cos-state
247+
# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
248+
# mount -o remount,ro /run/initramfs/cos-state
249+
```
250+
`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that the other nodes have enough free space, preventing the upgrade process from becoming stuck again.
251+
252+
1. Access the stuck node via SSH, and then run the following commands using the root account:
253+
```
254+
# mount -o remount,rw /run/initramfs/cos-state
255+
# cp /run/initramfs/cos-state/cOS/passive.img \
256+
/run/initramfs/cos-state/cOS/active.img
257+
# tune2fs -L COS_ACTIVE /run/initramfs/cos-state/cOS/active.img
258+
# mount -o remount,ro /run/initramfs/cos-state
259+
```
260+
The existing (clean) `passive.img` is copied over the corrupted `active.img` and the label is set correctly.
261+
262+
1. Reboot the stuck node, and then select the first entry ("Harvester v1.4.1") on the GRUB boot screen.
263+
264+
The GRUB boot screen initially displays "Harvester v1.4.1 (fallback)" by default. Despite the displayed version, the system boots into Harvester v1.4.0.
265+
266+
1. Copy `rootfs.squashfs` from the Harvester v1.4.1 ISO to a convenient location on the stuck node.
267+
268+
The ISO can be mounted either on the stuck node or on another system. You can copy the file using the `scp` command.
269+
270+
1. Access the stuck node via SSH, and then run the following commands using the root account:
271+
```
272+
# mkdir /tmp/manual-os-upgrade
273+
# mkdir /tmp/manual-os-upgrade/config
274+
# mkdir /tmp/manual-os-upgrade/rootfs
275+
# mount -o loop rootfs.squashfs /tmp/manual-os-upgrade/rootfs
276+
# cat > /tmp/manual-os-upgrade/config/config.yaml <<EOF
277+
upgrade:
278+
system:
279+
size: 3072
280+
EOF
281+
# elemental upgrade \
282+
--logfile /tmp/manual-os-upgrade/upgrade.log \
283+
--directory /tmp/manual-os-upgrade/rootfs \
284+
--config-dir /tmp/manual-os-upgrade/config \
285+
--debug
286+
```
287+
:::note
288+
289+
You must replace the sample path in the fourth line with the actual path of the copied `rootfs.squashfs`.
290+
291+
:::
292+
293+
A new (clean) `active.img` is generated based on the root image from the Harvester v1.4.1 ISO.
294+
295+
If any errors occur, save a copy of `/tmp/manual-os-upgrade/upgrade.log`.
296+
297+
1. Run the following commands:
298+
```
299+
# umount /tmp/manual-os-upgrade/rootfs
300+
# reboot
301+
```
302+
The node should boot successfully into Harvester v1.4.1, and the upgrade should proceed as expected.
303+
304+
305+
Related issues:
306+
- [[BUG] Stuck upgrade from 1.4.0 to 1.4.1](https://github.com/harvester/harvester/issues/7457)
307+
- [[BUG] discrepancy in default OS partition sizes when using separate data disk](https://github.com/harvester/harvester/issues/7493)
308+
- [[BUG] after initial installation, passive.img uses 3.1G of disk space, vs. active.img which only uses 1.7G](https://github.com/harvester/harvester/issues/7518)
309+

versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md

+163
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,28 @@ An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvest
1414

1515
For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
1616

17+
:::info important
18+
19+
Check the disk usage of the operating system images on each node before starting the upgrade. To do this, access the node via SSH and run the command `du -sh /run/initramfs/cos-state/cOS/*`.
20+
21+
Example:
22+
23+
```
24+
# du -sh /run/initramfs/cos-state/cOS/*
25+
1.7G /run/initramfs/cos-state/cOS/active.img
26+
3.1G /run/initramfs/cos-state/cOS/passive.img
27+
```
28+
29+
If `passive.img` (which represents the previously installed Harvester v1.4.0 image) consumes 3.1G of disk space, run the following commands using the root account:
30+
31+
```
32+
# mount -o remount,rw /run/initramfs/cos-state
33+
# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
34+
# mount -o remount,ro /run/initramfs/cos-state
35+
```
36+
`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that each node has enough free space, preventing the upgrade process from becoming [stuck in the "Waiting Reboot" state](#3-upgrade-is-stuck-in-the-waiting-reboot-state).
37+
:::
38+
1739

1840
### Update Harvester UI Extension on Rancher v2.10.1
1941

@@ -144,3 +166,144 @@ You can perform any of the following workarounds:
144166
![Upgrade with another default storage class workaround](/img/v1.4/upgrade/upgrade-with-another-default-storage-class-workaround.png)
145167
146168
For more information, see [Issue #7375](https://github.com/harvester/harvester/issues/7375).
169+
170+
### 3. Upgrade is stuck in the "Waiting Reboot" state
171+
172+
The upgrade process may become stuck in the "Waiting Reboot" state after the Harvester v1.4.1 image is installed on a node and a reboot is initiated. At this point, the upgrade controller observes if the Harvester v1.4.1 operating system is running.
173+
174+
If the Harvester v1.4.1 image (hereafter referred to as `active.img`) fails to boot for any reason, the node automatically restarts in fallback mode and boots the previously installed Harvester v1.4.0 image (hereafter referred to as `passive.img`). The upgrade controller is unable to detect the expected operating system, so the upgrade remains stuck until an administrator fixes the problem with `active.img`.
175+
176+
`active.img` can become corrupted and unbootable because of insufficient disk space in the COS_STATE partition during the upgrade. This occurs if Harvester v1.4.0 was originally installed on the node and the system was configured to use a separate data disk. The issue does not occur in the following situations:
177+
178+
- The system has a single disk that is shared by the operating system and data.
179+
- An earlier Harvester version was originally installed and then later upgraded to v1.4.0.
180+
181+
To check if the issue exists in your environment, perform the following steps:
182+
183+
1. Access the node via SSH and log in using the root account.
184+
185+
1. Run the commands `cat /proc/cmdline` and `head -n1 /etc/harvester-release.yaml`.
186+
187+
Example:
188+
189+
```
190+
# cat /proc/cmdline
191+
BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/passive.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off upgrade_failure
192+
193+
# head -n1 /etc/harvester-release.yaml
194+
harvester: v1.4.0
195+
```
196+
197+
The presence of `cos-img/filename=/cOS/passive.img` and `upgrade_failure` in the output indicates that the system booted into fallback mode. The Harvester version in `/etc/harvester-release.yaml` confirms that the system is currently using the v1.4.0 image.
198+
199+
1. Check if `active.img` is corrupted by running the command `fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img`.
200+
201+
Example:
202+
203+
```
204+
# fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img
205+
e2fsck 1.46.4 (18-Aug-2021)
206+
Pass 1: Checking inodes, blocks, and sizes
207+
Pass 2: Checking directory structure
208+
209+
[...a list of various different errors may appear here...]
210+
211+
e2fsck: aborted
212+
213+
COS_ACTIVE: ********** WARNING: Filesystem still has errors **********
214+
```
215+
216+
1. Check the partition sizes by running the command `lsblk -o NAME,LABEL,SIZE`.
217+
218+
Example:
219+
220+
```
221+
# lsblk -o NAME,LABEL,SIZE
222+
NAME LABEL SIZE
223+
loop0 COS_ACTIVE 3G
224+
sr0 1024M
225+
vda 250G
226+
├─vda1 COS_GRUB 64M
227+
├─vda2 COS_OEM 64M
228+
├─vda3 COS_RECOVERY 4G
229+
├─vda4 COS_STATE 8G
230+
└─vda5 COS_PERSISTENT 237.9G
231+
vdb HARV_LH_DEFAULT 128G
232+
```
233+
234+
The output in the example shows a COS_STATE partition that is 8G in size. In this specific case, which involves an unsuccessful upgrade attempt and a corrupted `active.img`, the partition likely did not have enough free space for the upgrade to succeed.
235+
236+
To fix the issue, perform the following steps:
237+
238+
1. If your cluster has two or more nodes, access the remaining nodes via SSH and check the disk usage of `active.img` and `passive.img`.
239+
```
240+
# du -sh /run/initramfs/cos-state/cOS/*
241+
1.7G /run/initramfs/cos-state/cOS/active.img
242+
3.1G /run/initramfs/cos-state/cOS/passive.img
243+
```
244+
If `passive.img` consumes 3.1G of disk space, run the following commands using the root account:
245+
```
246+
# mount -o remount,rw /run/initramfs/cos-state
247+
# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
248+
# mount -o remount,ro /run/initramfs/cos-state
249+
```
250+
`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that the other nodes have enough free space, preventing the upgrade process from becoming stuck again.
251+
252+
1. Access the stuck node via SSH, and then run the following commands using the root account:
253+
```
254+
# mount -o remount,rw /run/initramfs/cos-state
255+
# cp /run/initramfs/cos-state/cOS/passive.img \
256+
/run/initramfs/cos-state/cOS/active.img
257+
# tune2fs -L COS_ACTIVE /run/initramfs/cos-state/cOS/active.img
258+
# mount -o remount,ro /run/initramfs/cos-state
259+
```
260+
The existing (clean) `passive.img` is copied over the corrupted `active.img` and the label is set correctly.
261+
262+
1. Reboot the stuck node, and then select the first entry ("Harvester v1.4.1") on the GRUB boot screen.
263+
264+
The GRUB boot screen initially displays "Harvester v1.4.1 (fallback)" by default. Despite the displayed version, the system boots into Harvester v1.4.0.
265+
266+
1. Copy `rootfs.squashfs` from the Harvester v1.4.1 ISO to a convenient location on the stuck node.
267+
268+
The ISO can be mounted either on the stuck node or on another system. You can copy the file using the `scp` command.
269+
270+
1. Access the stuck node via SSH, and then run the following commands using the root account:
271+
```
272+
# mkdir /tmp/manual-os-upgrade
273+
# mkdir /tmp/manual-os-upgrade/config
274+
# mkdir /tmp/manual-os-upgrade/rootfs
275+
# mount -o loop rootfs.squashfs /tmp/manual-os-upgrade/rootfs
276+
# cat > /tmp/manual-os-upgrade/config/config.yaml <<EOF
277+
upgrade:
278+
system:
279+
size: 3072
280+
EOF
281+
# elemental upgrade \
282+
--logfile /tmp/manual-os-upgrade/upgrade.log \
283+
--directory /tmp/manual-os-upgrade/rootfs \
284+
--config-dir /tmp/manual-os-upgrade/config \
285+
--debug
286+
```
287+
:::note
288+
289+
You must replace the sample path in the fourth line with the actual path of the copied `rootfs.squashfs`.
290+
291+
:::
292+
293+
A new (clean) `active.img` is generated based on the root image from the Harvester v1.4.1 ISO.
294+
295+
If any errors occur, save a copy of `/tmp/manual-os-upgrade/upgrade.log`.
296+
297+
1. Run the following commands:
298+
```
299+
# umount /tmp/manual-os-upgrade/rootfs
300+
# reboot
301+
```
302+
The node should boot successfully into Harvester v1.4.1, and the upgrade should proceed as expected.
303+
304+
305+
Related issues:
306+
- [[BUG] Stuck upgrade from 1.4.0 to 1.4.1](https://github.com/harvester/harvester/issues/7457)
307+
- [[BUG] discrepancy in default OS partition sizes when using separate data disk](https://github.com/harvester/harvester/issues/7493)
308+
- [[BUG] after initial installation, passive.img uses 3.1G of disk space, vs. active.img which only uses 1.7G](https://github.com/harvester/harvester/issues/7518)
309+

0 commit comments

Comments
 (0)