Skip to content

Netcup Root Server (KVM): UEFI install ok, first boot reports “system disk not found” #12505

@peterlobster

Description

@peterlobster

I’m trying to install Talos Linux v1.12.0 on a Netcup “Root Server” (KVM-based virtual server with dedicated resources). Netcup Root Servers are effectively KVM VMs (not bare metal), managed via their SCP panel with remote console and virtual media. (netcup)

What we’re seeing

We have two related boot-path problems:

  1. Legacy BIOS / CSM path appears blocked in this environment
  • When the server is configured for BIOS boot, Talos ISOs are not detected as bootable and the console reports no bootable disk (even when the ISO/DVD is attached via the provider panel).
  • We tried multiple Talos factory ISO variants (metal, nocloud, auto BIOS/UEFI, dual boot). Same outcome.
  1. UEFI path boots the ISO and installs, but the first boot from disk fails
  • In UEFI mode, the Talos ISO boots and we can apply machine config in maintenance mode.

  • Talos writes to disk successfully (we can reboot back into the ISO later and Talos detects a pre-existing install on disk, consistent with Talos ISO behavior documented by Sidero). (Talos)

  • However, after detaching the ISO and booting from disk, the node fails during what looks like a first-boot / post-install stage with messages indicating it cannot find the system disk (or that the system disk is missing / not found / not declared). At that point it does not boot into a healthy Talos runtime.

    Observed failure signature on “bad” boots:

    [talos] task haltIfInstalled (1/1): Talos is already installed to disk but booted from another media and talos.halt_if_installed kernel parameter is set. Please reboot from the disk.

The key Talos-facing issue is that the install appears to complete and persist on disk, but the installed system cannot reliably boot and/or cannot reliably locate the system disk on first boot in this UEFI environment.

  • Talos uses systemd-boot for UEFI systems and GRUB for legacy BIOS on x86_64. (Talos)
    We might be landing in an edge case where the BIOS bootloader path isn’t reachable (provider limitation), and the UEFI bootloader path works for installation but fails on first boot due to firmware quirks or disk identification instability.
  • The Root Server's UEFI implementation may be incomplete or may not persist EFI variables reliably across full power cycles (provider-controlled firmware).
  • Netcup states TPM is not supported (it's not required for our goal, but it’s a clue about virtual firmware feature completeness). (netcup)

Additional context: Talos Factory extensions

We generated Talos factory images with a minimal set of system extensions, including qemu-guest-agent (and a small number of others). Since the kernel supports virtio and other typical KVM devices, we wouldn’t expect these extensions to affect the bootloader/system-disk discovery path, but mentioning it here for completeness.

Reproduction steps (as close as we can make it)

  1. Provision a Netcup Root Server (KVM VM) and attach Talos factory ISO via provider panel.

  2. Boot in UEFI mode (BIOS mode fails to detect Talos ISO as bootable in this environment).

  3. Apply machine config (examples below).

  4. Talos installs to disk and reboots.

  5. Detach ISO and boot from disk.

  6. Observe console output on first boot after install (or after ISO detach):

    [talos] task haltIfInstalled (1/1): Talos is already installed to disk but booted from another media and talos.halt_if_installed kernel parameter is set. Please reboot from the disk.

We repeated all of the following multiple times:

  • ACPI shutdown + start cycles
  • full drive wipes at the provider level
  • switching between UEFI and BIOS multiple times
  • testing multiple ISO variants (metal, nocloud, auto BIOS/UEFI, dual boot)
  • attempting a QCOW2 via provider “custom image upload” (did not boot)

We have a thorough reproduction of our setup in this repo: https://github.com/peterlobster/talos-linux-root-server-boot-repro

What we expected

  • UEFI install should result in a consistently bootable disk.
  • On first boot, Talos should come up cleanly (DHCP is fine; static config is also acceptable if required).
  • We should be able to reach the API and continue cluster bootstrap.

Config snippets from repro (sanitized)

We’re using generated configs with the install disk set explicitly.

The configs are stored in the repro GitHub repo: https://github.com/peterlobster/talos-linux-root-server-boot-repro

Control plane config key details

cluster:
  clusterName: Lab
  controlPlane:
    endpoint: https://<CONTROL_PLANE_PUBLIC_IP>:6443

machine:
  install:
    disk: /dev/vda
    image: ghcr.io/siderolabs/installer:v1.12.0
    wipe: false

These fields appear in the controlplane.yaml.

Worker config key details

machine:
  install:
    disk: /dev/vda
    image: ghcr.io/siderolabs/installer:v1.12.0
    wipe: false

These fields appear in the worker.yaml.

Kubernetes version (expected)
Our configs reference kubelet image v1.35.0.

Some questions we kinda ran into

  1. Disk selection stability: could /dev/vda be unstable across reboots on some KVM setups? Would you recommend using a /dev/disk/by-id/... path (or another Talos-supported selector) specifically for hosted KVM environments to avoid “system disk not found” after reboot?

  2. UEFI firmware quirks: are there known issues with Talos on UEFI implementations that do not fully persist EFI variables (NVRAM) across power cycles? If so, does Talos always install a fallback boot path that doesn’t depend on persisted EFI vars?

  3. SecureBoot / key enrollment confusion: we have seen messages that look like secure boot assets are being copied or referenced even when secure boot isn’t supported/enabled on the platform. Is there a known failure mode if the firmware doesn’t support expected secure boot enrollment flows? (We did test multiple ISO variants, and can confirm exactly which ones if needed.) (Talos)

  4. Recommended logging path: since the system fails very early (before Talos API is reliably available), what’s the best way to capture boot logs you’d find useful? We can provide:

    • full console output screenshots / transcriptions
    • serial console logs if you recommend specific kernel args for Talos factory images

Logs

At the moment, the main logs we have are from the provider VNC console during first boot from disk.

Image

Console output (first boot after install):

Image Image Image Image Image
[talos] task haltIfInstalled (1/1): Talos is already installed to disk but booted from another media and talos.halt_if_installed kernel parameter is set. Please reboot from the disk.

If you tell us the best way to enable early boot logging (serial console, kernel args, etc.), we can re-run the install and attach the requested logs.

Environment

  • Talos version: v1.12.0 (GitHub)
  • Kubernetes version: expected v1.35.0 (per kubelet image in configs)
  • Platform: Netcup Root Server (KVM-based VM with dedicated resources; provider-controlled UEFI/BIOS settings + virtual media) (netcup)
  • Firmware: Tried both BIOS and UEFI modes (BIOS path does not boot Talos ISOs in this environment; UEFI path installs but first boot fails)
  • Notes: Provider (Netcup) indicates TPM not supported (netcup)

Thanks again for any guidance. If there’s a known best practice for “provider UEFI that might not persist NVRAM” + Talos bootloader expectations, we’d love to align with it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions