Skip to content

grubcmdline notifies reboot handler unnecessarily  #36

Open
@DavidFair

Description

@DavidFair

Whilst configuring machines with the grubcmdline the check for the command line is not idempotent, so the same machine will reboot every time.

I suspect this is either

  • related to crashkernel appearing multiple times in the args (which we should fix in the caller, but should the role does not guard against) or
  • resume=UUID gets packed into the cmdline multiple times

The role should either throw a fatal: message, or not notify the reboot handler in these cases

Logs

TASK [stackhpc.linux.grubcmdline : Notify reboot handler] **********************
Friday 20 December 2024  11:22:41 +0000 (0:00:00.049)       0:00:08.921 ******* 
changed: [hv.example.com] => 
  msg: Old command line was BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=
c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
....
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX_DEFAULT] *********
Friday 20 December 2024  11:22:42 +0000 (0:00:00.039)       0:00:09.543 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux_default: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
...
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX] *****************
Friday 20 December 2024  11:22:42 +0000 (0:00:00.038)       0:00:09.582 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
.....
.....
  
  
TASK [stackhpc.linux.grubcmdline : Display newly computed GRUB_CMDLINE_LINUX_DEFAULT] ***
Friday 20 December 2024  11:22:42 +0000 (0:00:00.046)       0:00:09.816 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux_new:
  - crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
  - resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4
  - rhgb
  - quiet
  - crashkernel=auto
  - rhgb
  - quiet
  - crashkernel=auto

After reboot:

cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto

Calling Module

  vars:
    kernel_cmdline_args:
      - "rhgb"
      - "quiet"
      - "crashkernel=auto"
    kernel_cmdline_args_remove:
      - hugepage
      - selinux
  roles:
    - name: stackhpc.linux.grubcmdline
      vars:
        kernel_cmdline: "{{ kernel_cmdline_args }}"
        kernel_cmdline_remove: "{{ kernel_cmdline_args_remove }}"

Own thoughts:

  • Changing the default parameters on /etc/grub2.cfg or grubby args is working around the existing tooling built into the OS and grub:

Adding args should be done through templating files into /etc/grub.d for each override (RH Grubby Docs and StackOverflow example), such as /etc/grub.d/60-stackhpc-grubcmdline-quiet then simply invoke grubby or grub-mkconfig without args

Removing entries would still edit /etc/default/grub, but ansible.builtin.lineinfile should do the "surgical cuts" instead to minimise the changes to the file rather than trying generate a "new" cmdline using Jinja templates (which could generate wrong and result in an unbootable machine).
Alternatively, a grub script can be templated out /etc/grub.d which iterates over the GRUB_CMDLINE_LINUX* variables and appends every argument back except the matching one using the for in and if statements built into grub script. This would be more correct, but I don't have an adaptable example to hand.

The biggest advantages are:

  • Prevents the role fighting upstream changes if they change the default cmdline (e.g. in-place OS upgrades)
  • Would avoid the problems with the UUID
  • Makes troubleshooting easy to diagnose if it's a distro or kernel flag change: mv /etc/grub.d/*stackhpc-grubcmdline /etc/grub.d.bak && grubby ...
  • Idempotent by default, template through ansible then use a handler for re-running grubby and notify reboot if any template / lineinfile has changed

If not the above changes:

  • I think old_cmdline != kernel_cmdline | select() | sort | list is potentially brittle as there will be more problems like this in the future. Instead it could be something akin to the following pseudo code:
changed_when: not(all(grub_cmdline_linux_new in kernel_cmdline)) or any(grub_cmdline_linux_remove in kernel_cmdline)

This would only check for keywords the user has explicitly set in the role, rather than being affected by parameters (such as root=UUID=abc...) which they have not specified at all. However, this still leaves the potential problems around directly generating /etc/default/grub

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions