Skip to content
This repository was archived by the owner on Apr 16, 2026. It is now read-only.

HOTFIX 2025-08-15#99

Closed
dhilst wants to merge 1 commit intoansible-roles-migration-part-1from
hotfix-20250814
Closed

HOTFIX 2025-08-15#99
dhilst wants to merge 1 commit intoansible-roles-migration-part-1from
hotfix-20250814

Conversation

@dhilst
Copy link
Copy Markdown
Collaborator

@dhilst dhilst commented Aug 15, 2025

Fixes for the deployment issues captured on 2025-08-13

  • Update to Rocky Linux 9.6
  • Use the kernel version from the answer file in all places where it is required
  • Update OFED installation and version in the answer file and repos.conf file
  • Update the image generation to install the proper kernel
  • Improve exit code checking safety
  • Dump cluster state before starting the installation (useful for troubleshooting & pinpointing causes)

@dhilst dhilst changed the title WIP HOTFIX HOTFIX 2025-08-15 Aug 15, 2025
@dhilst dhilst marked this pull request as ready for review August 15, 2025 10:33
Comment thread src/services/ansible/roles/base.cpp Outdated
Comment on lines +67 to +68
"chkconfig",
"initscripts",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this seems OK. But this is only a requirement for xCAT.

Comment thread src/services/runner.cpp Outdated
namespace cloyster::services::runner {

int shell(std::string_view cmd) { return shellfmt("{}", cmd); }
void shell(std::string_view cmd) { shellfmt("{}", cmd); }
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to return the execution code, since it's always an integer it could have some usage. If we change to path to thrown an exception it may start to be a problem if we use the return code in other place of the code and I think we use it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have two variants, unsafe does not check the exit code, but has [[nodiscard]] so that the compiler enforces that we check it, and the safe version aborts if the exit code is non-zero. We need that because some commands are intended to fail, like when I check if the image exists or if a package is installed the exit code is non-zero

Comment thread src/services/shell.cpp Outdated
"mtu {} ipv4.method manual ipv4.address {}/{} "
"ipv4.dns \"{}\" "
// "ipv4.gateway {} ipv4.dns \"{}\" "
// @FIXME: This will break Confluent, is it required by xCAT?
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the IPv6 disablement was a leftover.

Comment thread src/services/xcat.cpp
void XCAT::installPackages()
{
auto osservice = cloyster::Singleton<IOSService>::get();
osservice->install("initscripts");
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should stay here.

The thing is... it should be one of the firsts. Because xCAT and only xCAT requires that. Modern software don't care for initscripts at all. For instance, Confluent does not need it.

Copy link
Copy Markdown
Collaborator Author

@dhilst dhilst Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to base.cpp, it is in among the first packages installed, that's why I removed this line here

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to base.cpp, it is in among the first packages installed, that's why I removed this line here

Yes, I got it. But this package is only needed due to xCAT. I'm not sure if we must ship it by default.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to address this during the Confluent integration

https://github.com/viniciusferrao/cloysterhpc/pull/101/files#r2282713249

Comment thread test/sample/answerfile/rocky9-base.ini Outdated
kernel=5.14.0-427.13.1.el9_4.x86_64
version=9.6
# kernel=5.14.0-427.13.1.el9_4.x86_64
kernel=5.14.0-503.21.1.el9_5.x86_64
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we should have a compatibility bitmap to map the supported versions.

@dhilst dhilst force-pushed the hotfix-20250814 branch 4 times, most recently from 96b0b5e to c0cf12a Compare August 18, 2025 11:58
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
5.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@dhilst dhilst changed the base branch from ansible-roles-migration-part-1 to master September 23, 2025 20:20
@dhilst dhilst changed the base branch from master to ansible-roles-migration-part-1 September 23, 2025 20:21
- Fix exit code unhandling bug
- Get kernel version from answerfile
- Fix OFED version in repos.conf
- Fix image generation with custom kernel
- Fix spack idempotency
- Fix diskImage use before set
- Run clang-format
- Enable keepcache in dnf
- Fix from where the kernel version is fetched
- Use running kernel when kernel option is ommited in the answerfile
- Fix slurld error in the computing node
- Comment kernel in rocky9-base.ini
- Fix makedns command & restart chronyd
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
5.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@dhilst
Copy link
Copy Markdown
Collaborator Author

dhilst commented Sep 23, 2025

Merged at #102

@dhilst dhilst closed this Sep 23, 2025
@dhilst dhilst deleted the hotfix-20250814 branch September 23, 2025 20:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants