Skip to content

Adding a disk to an instance changed the boot order #5112

Open
@citrus-it

Description

@citrus-it

I added a disk to an instance that has been running in the colo for a while, and it failed to boot afterwards, dropping to the UEFI shell. I've replicate this with a fresh instance and the rest of this note is from that replication case.

One notable thing about the VM that I originally saw the problem with is that its two disks were in slots 1 and 2, with nothing present in slot 0. This is likely because it was created before the fix for #5067 was merged.

To replicate the failure, I created a new disk from an image, and then two additional blank ones. By attaching them to a new instance in the right order, then detaching a blank disk again, I was able to end up with an instance in the same configuration, with the boot disk in slot 1 and slot 0 being empty.

                 name                | slot
-------------------------------------+-------
  test-omnios-bloody-20240215-e87155 |    1
  blank2                             |    2

I then booted this instance, which was successful, and mounted the EFI System Partition (ESP) to fish out the NvVars file which is where the UEFI bootrom stores its persistent variables. Decoding this shows that the bootrom has enumerated all of the possible boot devices, assigned them numbers and configured an initial boot order:

Variable        Value                    Notes
--------        -----                    ------
Boot0000        UIApp
Boot0001        UEFI                    <-- slot 1
Boot0002        UEFI 2                  <-- slot 2
Boot0003        UEFI Non-Block Device   <-- slot 8 (cidata volume)
Boot0004        UEFI PXE v4
Boot0005        EFI Internal Shell
BootOrder       0, 1, 2, 3, 4, 5

So far so good. I rebooted the instance a couple of times to confirm that it booted normally, and that these variables didn't change.

I then shut down the instance and attached a new blank disk to it. This disk was 128G in size and used a 4096 sector size. After this, the database showed that the new disk has been placed in slot 0. This mirrors what happened with the previously failed instance.

                 name                | slot
-------------------------------------+-------
  test-omnios-bloody-20240215-e87155 |    1
  blank4096                          |    0
  blank2                             |    2

On booting the instance back up, it dropped to the EFI shell after failing to boot from Boot0003 and via PXE:

BdsDxe: failed to load Boot0003 "UEFI Non-Block Boot Device" from PciRoot(0x0)/Pci(0x18,0x0): Not Found

>>Start PXE over IPv4.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0004 "UEFI PXEv4 (MAC:A84025FDD042)" from PciRoot(0x0)/Pci(0x9,0x0)/MAC(A84025FDD042,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found
BdsDxe: loading Boot0005 "EFI Internal Shell" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(7C04A583-9E3E-4F1C-AD65-E05268D0B4D1)
BdsDxe: starting Boot0005 "EFI Internal Shell" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(7C04A583-9E3E-4F1C-AD65-E05268D0B4D1)
UEFI Interactive Shell v2.2
EDK II
UEFI v2.70 (EDK II, 0x00010000)
Shell>

Using the EFI shell to look at the persistent variables now showed something interesting:

Boot0000        UIApp
Boot0001        UEFI                    <-- slot 1
Boot0002        UEFI 2                  <-- slot 2
Boot0003        UEFI Non-Block Device   <-- slot 8 (cidata volume)
Boot0004        UEFI PXE v4
Boot0005        EFI Internal Shell
Boot0006        UEFI 3                  <-- slot 0 (newly added drive)
Boot Order      0, 3, 4, 5, 1, 2, 6

The new disk has been enumerated and added as Boot 0006, which is not a surprise, but the boot order has been changed so that all three NVMe disks are now at the end. This explains why the instance attempted to boot from Boot0003, which is the cidata volume, and failed, then tried PXE boot and finally dropped to the EFI shell.

The bootrom's debug output from this boot also shows this same strange boot order:

[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0000: UiApp              0x0109
    Boot0003: UEFI Non-Block Boot Device                 0x0001
    Boot0004: UEFI PXEv4 (MAC:A84025FAF1FF)              0x0001
    Boot0005: EFI Internal Shell                 0x0001
    Boot0001: UEFI               0x0001
    Boot0002: UEFI  2            0x0001
    Boot0006: UEFI  3            0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery               0x0001
[Bds]=============End Load Options Dumping=============

To replicate this I faithfully reproduced what happened in the colo -- not all of the steps here may be necessary to trigger it, more experimentation is necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    customerFor any bug reports or feature requests tied to customer requestsknown issueTo include in customer documentation and training

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions