fix: add v1.4.1 known issue: upgrade stuck in "Waiting Reboot" state (#715)

tserong · jillian-maroket · web-flow · commit 85d69ccbd80d · 2025-02-13T16:09:29.000+11:00
* fix: add v1.4.1 known issue: upgrade stuck in "Waiting Reboot" state Related issue: harvester/harvester#7457 Signed-off-by: Tim Serong <tserong@suse.com> Co-authored-by: Jillian Maroket <67180770+jillian-maroket@users.noreply.github.com>
diff --git a/docs/upgrade/v1-4-0-to-v1-4-1.md b/docs/upgrade/v1-4-0-to-v1-4-1.md
@@ -14,6 +14,28 @@ An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvest
 
 For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
 
+:::info important
+
+Check the disk usage of the operating system images on each node before starting the upgrade. To do this, access the node via SSH and run the command `du -sh /run/initramfs/cos-state/cOS/*`.
+
+Example:
+
+```
+# du -sh /run/initramfs/cos-state/cOS/*
+1.7G    /run/initramfs/cos-state/cOS/active.img
+3.1G    /run/initramfs/cos-state/cOS/passive.img
+```
+
+If `passive.img` (which represents the previously installed Harvester v1.4.0 image) consumes 3.1G of disk space, run the following commands using the root account:
+
+```
+# mount -o remount,rw /run/initramfs/cos-state
+# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
+# mount -o remount,ro /run/initramfs/cos-state
+```
+`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that each node has enough free space, preventing the upgrade process from becoming [stuck in the "Waiting Reboot" state](#3-upgrade-is-stuck-in-the-waiting-reboot-state).
+:::
+
 
 ### Update Harvester UI Extension on Rancher v2.10.1
 
@@ -144,3 +166,144 @@ You can perform any of the following workarounds:
     ![Upgrade with another default storage class workaround](/img/v1.4/upgrade/upgrade-with-another-default-storage-class-workaround.png)
 
 For more information, see [Issue #7375](https://github.com/harvester/harvester/issues/7375).
+
+### 3. Upgrade is stuck in the "Waiting Reboot" state
+
+The upgrade process may become stuck in the "Waiting Reboot" state after the Harvester v1.4.1 image is installed on a node and a reboot is initiated. At this point, the upgrade controller observes if the Harvester v1.4.1 operating system is running.
+
+If the Harvester v1.4.1 image (hereafter referred to as `active.img`) fails to boot for any reason, the node automatically restarts in fallback mode and boots the previously installed Harvester v1.4.0 image (hereafter referred to as `passive.img`). The upgrade controller is unable to detect the expected operating system, so the upgrade remains stuck until an administrator fixes the problem with `active.img`.
+
+`active.img` can become corrupted and unbootable because of insufficient disk space in the COS_STATE partition during the upgrade. This occurs if Harvester v1.4.0 was originally installed on the node and the system was configured to use a separate data disk. The issue does not occur in the following situations:
+
+- The system has a single disk that is shared by the operating system and data.
+- An earlier Harvester version was originally installed and then later upgraded to v1.4.0.
+
+To check if the issue exists in your environment, perform the following steps:
+
+1. Access the node via SSH and log in using the root account.
+
+1. Run the commands `cat /proc/cmdline` and `head -n1 /etc/harvester-release.yaml`.
+
+    Example:
+
+    ```
+    # cat /proc/cmdline
+    BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/passive.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off upgrade_failure
+    
+    # head -n1 /etc/harvester-release.yaml
+    harvester: v1.4.0
+    ```
+
+    The presence of `cos-img/filename=/cOS/passive.img` and `upgrade_failure` in the output indicates that the system booted into fallback mode. The Harvester version in `/etc/harvester-release.yaml` confirms that the system is currently using the v1.4.0 image.
+
+1. Check if `active.img` is corrupted by running the command `fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img`.
+
+    Example:
+
+    ```
+    # fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img
+    e2fsck 1.46.4 (18-Aug-2021)
+    Pass 1: Checking inodes, blocks, and sizes
+    Pass 2: Checking directory structure
+    
+      [...a list of various different errors may appear here...]
+    
+    e2fsck: aborted
+    
+    COS_ACTIVE: ********** WARNING: Filesystem still has errors **********
+    ```
+
+1. Check the partition sizes by running the command `lsblk -o NAME,LABEL,SIZE`.
+
+    Example:
+
+    ```
+    # lsblk -o NAME,LABEL,SIZE
+    NAME   LABEL             SIZE
+    loop0  COS_ACTIVE          3G
+    sr0                     1024M
+    vda                      250G
+    ├─vda1 COS_GRUB           64M
+    ├─vda2 COS_OEM            64M
+    ├─vda3 COS_RECOVERY        4G
+    ├─vda4 COS_STATE           8G
+    └─vda5 COS_PERSISTENT  237.9G
+    vdb    HARV_LH_DEFAULT   128G
+    ```
+
+    The output in the example shows a COS_STATE partition that is 8G in size. In this specific case, which involves an unsuccessful upgrade attempt and a corrupted `active.img`, the partition likely did not have enough free space for the upgrade to succeed.
+
+To fix the issue, perform the following steps:
+
+1. If your cluster has two or more nodes, access the remaining nodes via SSH and check the disk usage of `active.img` and `passive.img`.
+   ```
+   # du -sh /run/initramfs/cos-state/cOS/*
+   1.7G    /run/initramfs/cos-state/cOS/active.img
+   3.1G    /run/initramfs/cos-state/cOS/passive.img
+   ```
+   If `passive.img` consumes 3.1G of disk space, run the following commands using the root account:
+   ```
+   # mount -o remount,rw /run/initramfs/cos-state
+   # fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img 
+   # mount -o remount,ro /run/initramfs/cos-state
+   ```
+   `passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that the other nodes have enough free space, preventing the upgrade process from becoming stuck again.
+
+1. Access the stuck node via SSH, and then run the following commands using the root account:
+   ```
+   # mount -o remount,rw /run/initramfs/cos-state
+   # cp /run/initramfs/cos-state/cOS/passive.img \
+        /run/initramfs/cos-state/cOS/active.img
+   # tune2fs -L COS_ACTIVE /run/initramfs/cos-state/cOS/active.img
+   # mount -o remount,ro /run/initramfs/cos-state
+   ```
+    The existing (clean) `passive.img` is copied over the corrupted `active.img` and the label is set correctly.
+
+1. Reboot the stuck node, and then select the first entry ("Harvester v1.4.1") on the GRUB boot screen.
+
+    The GRUB boot screen initially displays "Harvester v1.4.1 (fallback)" by default. Despite the displayed version, the system boots into Harvester v1.4.0.
+
+1. Copy `rootfs.squashfs` from the Harvester v1.4.1 ISO to a convenient location on the stuck node.
+
+    The ISO can be mounted either on the stuck node or on another system. You can copy the file using the `scp` command.
+
+1. Access the stuck node via SSH, and then run the following commands using the root account:
+   ```
+   # mkdir /tmp/manual-os-upgrade    
+   # mkdir /tmp/manual-os-upgrade/config
+   # mkdir /tmp/manual-os-upgrade/rootfs
+   # mount -o loop rootfs.squashfs /tmp/manual-os-upgrade/rootfs
+   # cat > /tmp/manual-os-upgrade/config/config.yaml <<EOF
+   upgrade:
+     system:
+       size: 3072
+   EOF
+   # elemental upgrade \
+           --logfile /tmp/manual-os-upgrade/upgrade.log \
+           --directory /tmp/manual-os-upgrade/rootfs \
+           --config-dir /tmp/manual-os-upgrade/config \
+           --debug
+   ```
+    :::note
+
+    You must replace the sample path in the fourth line with the actual path of the copied `rootfs.squashfs`.
+
+    :::   
+
+    A new (clean) `active.img` is generated based on the root image from the Harvester v1.4.1 ISO.
+    
+    If any errors occur, save a copy of `/tmp/manual-os-upgrade/upgrade.log`.
+    
+1. Run the following commands:
+   ```
+   # umount /tmp/manual-os-upgrade/rootfs
+   # reboot
+   ```
+    The node should boot successfully into Harvester v1.4.1, and the upgrade should proceed as expected.
+
+
+Related issues:
+- [[BUG] Stuck upgrade from 1.4.0 to 1.4.1](https://github.com/harvester/harvester/issues/7457)
+- [[BUG] discrepancy in default OS partition sizes when using separate data disk](https://github.com/harvester/harvester/issues/7493)
+- [[BUG] after initial installation, passive.img uses 3.1G of disk space, vs. active.img which only uses 1.7G](https://github.com/harvester/harvester/issues/7518)
+
diff --git a/versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md b/versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md
@@ -14,6 +14,28 @@ An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvest
 
 For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
 
+:::info important
+
+Check the disk usage of the operating system images on each node before starting the upgrade. To do this, access the node via SSH and run the command `du -sh /run/initramfs/cos-state/cOS/*`.
+
+Example:
+
+```
+# du -sh /run/initramfs/cos-state/cOS/*
+1.7G    /run/initramfs/cos-state/cOS/active.img
+3.1G    /run/initramfs/cos-state/cOS/passive.img
+```
+
+If `passive.img` (which represents the previously installed Harvester v1.4.0 image) consumes 3.1G of disk space, run the following commands using the root account:
+
+```
+# mount -o remount,rw /run/initramfs/cos-state
+# fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img
+# mount -o remount,ro /run/initramfs/cos-state
+```
+`passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that each node has enough free space, preventing the upgrade process from becoming [stuck in the "Waiting Reboot" state](#3-upgrade-is-stuck-in-the-waiting-reboot-state).
+:::
+
 
 ### Update Harvester UI Extension on Rancher v2.10.1
 
@@ -144,3 +166,144 @@ You can perform any of the following workarounds:
     ![Upgrade with another default storage class workaround](/img/v1.4/upgrade/upgrade-with-another-default-storage-class-workaround.png)
 
 For more information, see [Issue #7375](https://github.com/harvester/harvester/issues/7375).
+
+### 3. Upgrade is stuck in the "Waiting Reboot" state
+
+The upgrade process may become stuck in the "Waiting Reboot" state after the Harvester v1.4.1 image is installed on a node and a reboot is initiated. At this point, the upgrade controller observes if the Harvester v1.4.1 operating system is running.
+
+If the Harvester v1.4.1 image (hereafter referred to as `active.img`) fails to boot for any reason, the node automatically restarts in fallback mode and boots the previously installed Harvester v1.4.0 image (hereafter referred to as `passive.img`). The upgrade controller is unable to detect the expected operating system, so the upgrade remains stuck until an administrator fixes the problem with `active.img`.
+
+`active.img` can become corrupted and unbootable because of insufficient disk space in the COS_STATE partition during the upgrade. This occurs if Harvester v1.4.0 was originally installed on the node and the system was configured to use a separate data disk. The issue does not occur in the following situations:
+
+- The system has a single disk that is shared by the operating system and data.
+- An earlier Harvester version was originally installed and then later upgraded to v1.4.0.
+
+To check if the issue exists in your environment, perform the following steps:
+
+1. Access the node via SSH and log in using the root account.
+
+1. Run the commands `cat /proc/cmdline` and `head -n1 /etc/harvester-release.yaml`.
+
+    Example:
+
+    ```
+    # cat /proc/cmdline
+    BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/passive.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off upgrade_failure
+    
+    # head -n1 /etc/harvester-release.yaml
+    harvester: v1.4.0
+    ```
+
+    The presence of `cos-img/filename=/cOS/passive.img` and `upgrade_failure` in the output indicates that the system booted into fallback mode. The Harvester version in `/etc/harvester-release.yaml` confirms that the system is currently using the v1.4.0 image.
+
+1. Check if `active.img` is corrupted by running the command `fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img`.
+
+    Example:
+
+    ```
+    # fsck.ext2 -nf /run/initramfs/cos-state/cOS/active.img
+    e2fsck 1.46.4 (18-Aug-2021)
+    Pass 1: Checking inodes, blocks, and sizes
+    Pass 2: Checking directory structure
+    
+      [...a list of various different errors may appear here...]
+    
+    e2fsck: aborted
+    
+    COS_ACTIVE: ********** WARNING: Filesystem still has errors **********
+    ```
+
+1. Check the partition sizes by running the command `lsblk -o NAME,LABEL,SIZE`.
+
+    Example:
+
+    ```
+    # lsblk -o NAME,LABEL,SIZE
+    NAME   LABEL             SIZE
+    loop0  COS_ACTIVE          3G
+    sr0                     1024M
+    vda                      250G
+    ├─vda1 COS_GRUB           64M
+    ├─vda2 COS_OEM            64M
+    ├─vda3 COS_RECOVERY        4G
+    ├─vda4 COS_STATE           8G
+    └─vda5 COS_PERSISTENT  237.9G
+    vdb    HARV_LH_DEFAULT   128G
+    ```
+
+    The output in the example shows a COS_STATE partition that is 8G in size. In this specific case, which involves an unsuccessful upgrade attempt and a corrupted `active.img`, the partition likely did not have enough free space for the upgrade to succeed.
+
+To fix the issue, perform the following steps:
+
+1. If your cluster has two or more nodes, access the remaining nodes via SSH and check the disk usage of `active.img` and `passive.img`.
+   ```
+   # du -sh /run/initramfs/cos-state/cOS/*
+   1.7G    /run/initramfs/cos-state/cOS/active.img
+   3.1G    /run/initramfs/cos-state/cOS/passive.img
+   ```
+   If `passive.img` consumes 3.1G of disk space, run the following commands using the root account:
+   ```
+   # mount -o remount,rw /run/initramfs/cos-state
+   # fallocate --dig-holes /run/initramfs/cos-state/cOS/passive.img 
+   # mount -o remount,ro /run/initramfs/cos-state
+   ```
+   `passive.img` is converted to a sparse file, which should only consume 1.7G of disk space (the same as `active.img`). This ensures that the other nodes have enough free space, preventing the upgrade process from becoming stuck again.
+
+1. Access the stuck node via SSH, and then run the following commands using the root account:
+   ```
+   # mount -o remount,rw /run/initramfs/cos-state
+   # cp /run/initramfs/cos-state/cOS/passive.img \
+        /run/initramfs/cos-state/cOS/active.img
+   # tune2fs -L COS_ACTIVE /run/initramfs/cos-state/cOS/active.img
+   # mount -o remount,ro /run/initramfs/cos-state
+   ```
+    The existing (clean) `passive.img` is copied over the corrupted `active.img` and the label is set correctly.
+
+1. Reboot the stuck node, and then select the first entry ("Harvester v1.4.1") on the GRUB boot screen.
+
+    The GRUB boot screen initially displays "Harvester v1.4.1 (fallback)" by default. Despite the displayed version, the system boots into Harvester v1.4.0.
+
+1. Copy `rootfs.squashfs` from the Harvester v1.4.1 ISO to a convenient location on the stuck node.
+
+    The ISO can be mounted either on the stuck node or on another system. You can copy the file using the `scp` command.
+
+1. Access the stuck node via SSH, and then run the following commands using the root account:
+   ```
+   # mkdir /tmp/manual-os-upgrade    
+   # mkdir /tmp/manual-os-upgrade/config
+   # mkdir /tmp/manual-os-upgrade/rootfs
+   # mount -o loop rootfs.squashfs /tmp/manual-os-upgrade/rootfs
+   # cat > /tmp/manual-os-upgrade/config/config.yaml <<EOF
+   upgrade:
+     system:
+       size: 3072
+   EOF
+   # elemental upgrade \
+           --logfile /tmp/manual-os-upgrade/upgrade.log \
+           --directory /tmp/manual-os-upgrade/rootfs \
+           --config-dir /tmp/manual-os-upgrade/config \
+           --debug
+   ```
+    :::note
+
+    You must replace the sample path in the fourth line with the actual path of the copied `rootfs.squashfs`.
+
+    :::   
+
+    A new (clean) `active.img` is generated based on the root image from the Harvester v1.4.1 ISO.
+    
+    If any errors occur, save a copy of `/tmp/manual-os-upgrade/upgrade.log`.
+    
+1. Run the following commands:
+   ```
+   # umount /tmp/manual-os-upgrade/rootfs
+   # reboot
+   ```
+    The node should boot successfully into Harvester v1.4.1, and the upgrade should proceed as expected.
+
+
+Related issues:
+- [[BUG] Stuck upgrade from 1.4.0 to 1.4.1](https://github.com/harvester/harvester/issues/7457)
+- [[BUG] discrepancy in default OS partition sizes when using separate data disk](https://github.com/harvester/harvester/issues/7493)
+- [[BUG] after initial installation, passive.img uses 3.1G of disk space, vs. active.img which only uses 1.7G](https://github.com/harvester/harvester/issues/7518)
+