Description
Whilst configuring machines with the grubcmdline
the check for the command line is not idempotent, so the same machine will reboot every time.
I suspect this is either
- related to
crashkernel
appearing multiple times in the args (which we should fix in the caller, but should the role does not guard against) or resume=UUID
gets packed into the cmdline multiple times
The role should either throw a fatal:
message, or not notify the reboot handler in these cases
Logs
TASK [stackhpc.linux.grubcmdline : Notify reboot handler] **********************
Friday 20 December 2024 11:22:41 +0000 (0:00:00.049) 0:00:08.921 *******
changed: [hv.example.com] =>
msg: Old command line was BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=
c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
....
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX_DEFAULT] *********
Friday 20 December 2024 11:22:42 +0000 (0:00:00.039) 0:00:09.543 *******
ok: [hv.example.com] =>
grub_cmdline_linux_default: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
...
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX] *****************
Friday 20 December 2024 11:22:42 +0000 (0:00:00.038) 0:00:09.582 *******
ok: [hv.example.com] =>
grub_cmdline_linux: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
.....
.....
TASK [stackhpc.linux.grubcmdline : Display newly computed GRUB_CMDLINE_LINUX_DEFAULT] ***
Friday 20 December 2024 11:22:42 +0000 (0:00:00.046) 0:00:09.816 *******
ok: [hv.example.com] =>
grub_cmdline_linux_new:
- crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
- resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4
- rhgb
- quiet
- crashkernel=auto
- rhgb
- quiet
- crashkernel=auto
After reboot:
cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
Calling Module
vars:
kernel_cmdline_args:
- "rhgb"
- "quiet"
- "crashkernel=auto"
kernel_cmdline_args_remove:
- hugepage
- selinux
roles:
- name: stackhpc.linux.grubcmdline
vars:
kernel_cmdline: "{{ kernel_cmdline_args }}"
kernel_cmdline_remove: "{{ kernel_cmdline_args_remove }}"
Own thoughts:
- Changing the default parameters on
/etc/grub2.cfg
or grubby args is working around the existing tooling built into the OS and grub:
Adding args should be done through templating files into /etc/grub.d
for each override (RH Grubby Docs and StackOverflow example), such as /etc/grub.d/60-stackhpc-grubcmdline-quiet
then simply invoke grubby
or grub-mkconfig
without args
Removing entries would still edit /etc/default/grub
, but ansible.builtin.lineinfile
should do the "surgical cuts" instead to minimise the changes to the file rather than trying generate a "new" cmdline using Jinja templates (which could generate wrong and result in an unbootable machine).
Alternatively, a grub script can be templated out /etc/grub.d
which iterates over the GRUB_CMDLINE_LINUX*
variables and appends every argument back except the matching one using the for in
and if
statements built into grub script. This would be more correct, but I don't have an adaptable example to hand.
The biggest advantages are:
- Prevents the role fighting upstream changes if they change the default cmdline (e.g. in-place OS upgrades)
- Would avoid the problems with the UUID
- Makes troubleshooting easy to diagnose if it's a distro or kernel flag change:
mv /etc/grub.d/*stackhpc-grubcmdline /etc/grub.d.bak && grubby ...
- Idempotent by default, template through ansible then use a handler for re-running
grubby
andnotify reboot
if any template /lineinfile
has changed
If not the above changes:
- I think
old_cmdline != kernel_cmdline | select() | sort | list
is potentially brittle as there will be more problems like this in the future. Instead it could be something akin to the following pseudo code:
changed_when: not(all(grub_cmdline_linux_new in kernel_cmdline)) or any(grub_cmdline_linux_remove in kernel_cmdline)
This would only check for keywords the user has explicitly set in the role, rather than being affected by parameters (such as root=UUID=abc...
) which they have not specified at all. However, this still leaves the potential problems around directly generating /etc/default/grub