You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce fhpc_namespace extra variable with the name of containers namespace.
Add bash-completion script for firehpc command (#12).
Add firehpc list command to list clusters present in state directory (#16).
Add firehpc restore command to restore a cluster after restart or IP addresses change (#11→#31).
Save cluster settings on deployment so they can reused automatically in subsequent runs of firehpc conf and firehpc restore (#7).
Report cluster settings in firehpc status.
Introduce firehpc update command to change cluster settings.
Introduce firehpc bootstrap command to create deployment environments.
Add firehpc {conf,deploy} --ansible-opts option to append additional options to ansible-playbook command (#44).
Integrated management of virtual environment to multiple versions of Ansible depending on targeted OS (#24).
Add PIP requirements files to populate ansible-latest and ansible-2.16 deployment environments.
load:
Submit jobs randomly in existing QOS and partitions.
Submit jobs of various sizes, with a power of 2 number (1, 2, 4, 8…) of cores or nodes, up to the full size of the cluster. A number of nodes is selected when Slurm SelectType plugin is linear, a number of cores is selected otherwise. Small jobs are more submitted than big jobs.
Select job partition randomly weighted by their number of resources to favor largest partitions.
Make some (about 1/10th) submitted jobs randomly fail (#9).
Submit jobs with random durations and timelimit with low probability for jobs to reach their timelimit (#10).
Support Slurm configuration without accounting service and QOS.
Reduce load by a factor outside of business hours to simulate humans submitting less jobs when not at work (#29).
Add --time-off-factor option to control by how much the load is divided outside of business hours.
Request GPUs allocations on partitions with gpu GRES.
conf:
Add possibility to define additional QOS and alternative partitions in Slurm.
Add support for RHEL9 and compatibles distributions.
Add possibility to define custom site file name in nginx role.
Introduce metrics role to deploy prometheus, alloy and grafana.
Declare nodes in Slurm configuration with their socket/cores/memory configuration extracted from RacksDB.
Add params key in slurm_partitions parameter to give possibility to set any arbitrary Slurm partition configuration parameter in inventory.
Support Slurm native authentication in alternative to munge (#22).
Add possibility to disable deployment of SlurmDBD accounting service (#20).
Enable SlurmDBD regular archive and purge mechanism to avoid MariaDB database growing too much (#28). This can be disabled with slurm_with_db_archive: false in custom configuration.
Add restore playbook to update hosts file, restart Slurm services and resume unavailable nodes.
Add mariadb and dependencies tags on mariadb role in slurm role dependencies.
Support possibility to change priorities of hpck.it and Rackslab development repositories derivatives with common_hpckit_priorities and common_devs_priorities.
Add slurm_restd_port variable in inventory to control slurmrestd TCP/IP listening port.
Support all Slurm-web to slurmrestd JWT authentication modes.
Add sysctl fs.inotify.max_user_instances value increase recommendation in README.md to avoid weird issue when launching many containers.
Mention Metrics stack and Slurm-web optional features in README.md with URL to access Grafana and Slurm-web interfaces.
Explain in README.md Ansible core 2.16 requirement for both rocky8 and debian13 clusters with a method to install this version from PyPI repository.
Mention firehpc list command in manpage.
Mention firehpc load command in manpage.
Mention firehpc restore command in manpage.
Mention firehpc bootstrap command in manpage.
Mention cluster settings and firehpc update command in manpage.
Update README.md to mention bootstrap step in usage guide.
Mention --ansible-opts option in manpage.
pkgs: Introduce tests extra package with dependencies required to run tests.
Changed
Replace fhpc-emulate-slurm-usage command by firehpc load (#13).
Transform fhpc_nodes dictionary values from list of nodes to list of dictionaries to group nodes by type in RacksDB.
firehpc ssh <cluster> now connects to admin host by default (#8).
Rename file images.yml to os/db.yml and name of deployment environment associated to all supported OS.
Replace section [images] to [os] in system configuration with new db and requirements parameters.
load:
Change pending jobs limit formula to avoid number of jobs growing as fast as the number of nodes.
Consider running jobs in addition to pending jobs when computing the number of new jobs to submit, in order to significantly reduce load on clusters out of working hours.
conf:
Install socat package on all nodes in common role.
Use packages list instead of loop to install MariaDB packages.
Enable config_overrides slurmd parameter in Slurm configuration to avoid compute nodes sockets/cores/memory matching configuration check.
Move maxtime and state Slurm partitions parameters in params sub-dictionary.
Rename slurm_partitions > node→nodes key.
Change default Slurm authentication plugin from munge to slurm. This can be changed by setting slurm_with_munge: true in Ansible inventory.
Launch slurmrestd with unprivileged system user when JWT authentication is enabled.
Adapt slurm role to support Slurm upstream packages on Debian.
docs:
Explain in manpage ssh command considers admin container by default.
Update documentation of --db, --schema, --custom and --slurm-emulator options of conf and restore commands with their new semantics regarding management of cluster settings.
Use standard tomllib instead of tomlib external library of documentation Makefile.
Fixed
core:
Properly handle DBus error when getting containers addresses.
Potential key conflict in dictionnary of SSH clients when multiple users connect to the same host with Paramiko library.
Set jobs time limit to partition time limit when set to avoid jobs that exceed partition time limit.
Remove cluster state on cluster clean.
Check ansible playbook RC code and stop execution on failure.
lib: Fix firehpc-storage-wrapper start failure due to already existing cluster and home directories.
load:
Order of partition/qos variables in job submission informational message.
Support of Slurm 24.05 sacctmgr show qos --json format to retrieve the list of defined QOS.
Redirect jobs output to /dev/null to avoid filling filesystems with tons of inodes (#27).
conf:
Install mpi packages in parallel instead of sequential loop.
Configure system locale to en_US.UTF-8 on rocky8.
Add SLURMRESTD_SECURITY=disable_user_check environment variable in slurmrestd service to allow running as slurm user.
Containers namespace missing in Slurm-web gateway [ui] > host.
Force creation of CA and LDAP certificates to override possibly existing certificates during bootstrap.
Ignore cluster creation error in slurmdbd, as it is now automatically created when slurmctld registers to accounting service.
Support Rackslab development repository derivatives on RHEL.
Add admin hostname with namespace in addition to just the admin hostname in Slurm-web nginx site server names.
Replace embedded templates by string concatenations.