Skip to content

v1.2.0

Latest

Choose a tag to compare

@rezib rezib released this 18 Sep 18:39
v1.2.0

Added

  • images:
    • Add rocky9
    • Add debian13
    • Add debian14 (#42).
  • Introduce fhpc_namespace extra variable with the name of containers namespace.
  • Add bash-completion script for firehpc command (#12).
  • Add firehpc list command to list clusters present in state directory (#16).
  • Add firehpc restore command to restore a cluster after restart or IP addresses change (#11#31).
  • Save cluster settings on deployment so they can reused automatically in subsequent runs of firehpc conf and firehpc restore (#7).
  • Report cluster settings in firehpc status.
  • Introduce firehpc update command to change cluster settings.
  • Introduce firehpc bootstrap command to create deployment environments.
  • Add firehpc {conf,deploy} --ansible-opts option to append additional options to ansible-playbook command (#44).
  • Integrated management of virtual environment to multiple versions of Ansible depending on targeted OS (#24).
  • Add PIP requirements files to populate ansible-latest and ansible-2.16 deployment environments.
  • load:
    • Submit jobs randomly in existing QOS and partitions.
    • Submit jobs of various sizes, with a power of 2 number (1, 2, 4, 8…) of cores or nodes, up to the full size of the cluster. A number of nodes is selected when Slurm SelectType plugin is linear, a number of cores is selected otherwise. Small jobs are more submitted than big jobs.
    • Select job partition randomly weighted by their number of resources to favor largest partitions.
    • Make some (about 1/10th) submitted jobs randomly fail (#9).
    • Submit jobs with random durations and timelimit with low probability for jobs to reach their timelimit (#10).
    • Support Slurm configuration without accounting service and QOS.
    • Reduce load by a factor outside of business hours to simulate humans submitting less jobs when not at work (#29).
    • Add --time-off-factor option to control by how much the load is divided outside of business hours.
    • Request GPUs allocations on partitions with gpu GRES.
  • conf:
    • Add possibility to define additional QOS and alternative partitions in Slurm.
    • Add support for RHEL9 and compatibles distributions.
    • Add possibility to define custom site file name in nginx role.
    • Introduce metrics role to deploy prometheus, alloy and grafana.
    • Declare nodes in Slurm configuration with their socket/cores/memory configuration extracted from RacksDB.
    • Add params key in slurm_partitions parameter to give possibility to set any arbitrary Slurm partition configuration parameter in inventory.
    • Support Slurm native authentication in alternative to munge (#22).
    • Add possibility to disable deployment of SlurmDBD accounting service (#20).
    • Enable SlurmDBD regular archive and purge mechanism to avoid MariaDB database growing too much (#28). This can be disabled with slurm_with_db_archive: false in custom configuration.
    • Add restore playbook to update hosts file, restart Slurm services and resume unavailable nodes.
    • Add mariadb and dependencies tags on mariadb role in slurm role dependencies.
    • Support possibility to change priorities of hpck.it and Rackslab development repositories derivatives with common_hpckit_priorities and common_devs_priorities.
    • Add slurm_restd_port variable in inventory to control slurmrestd TCP/IP listening port.
    • Support all Slurm-web to slurmrestd JWT authentication modes.
    • Support gpu GRES in Slurm configuration (#39).
  • docs:
    • Add sysctl fs.inotify.max_user_instances value increase recommendation in README.md to avoid weird issue when launching many containers.
    • Mention Metrics stack and Slurm-web optional features in README.md with URL to access Grafana and Slurm-web interfaces.
    • Explain in README.md Ansible core 2.16 requirement for both rocky8 and debian13 clusters with a method to install this version from PyPI repository.
    • Mention firehpc list command in manpage.
    • Mention firehpc load command in manpage.
    • Mention firehpc restore command in manpage.
    • Mention firehpc bootstrap command in manpage.
    • Mention cluster settings and firehpc update command in manpage.
    • Update README.md to mention bootstrap step in usage guide.
    • Mention --ansible-opts option in manpage.
  • pkgs: Introduce tests extra package with dependencies required to run tests.

Changed

  • Replace fhpc-emulate-slurm-usage command by firehpc load (#13).
  • Transform fhpc_nodes dictionary values from list of nodes to list of dictionaries to group nodes by type in RacksDB.
  • firehpc ssh <cluster> now connects to admin host by default (#8).
  • Rename file images.yml to os/db.yml and name of deployment environment associated to all supported OS.
  • Replace section [images] to [os] in system configuration with new db and requirements parameters.
  • load:
    • Change pending jobs limit formula to avoid number of jobs growing as fast as the number of nodes.
    • Consider running jobs in addition to pending jobs when computing the number of new jobs to submit, in order to significantly reduce load on clusters out of working hours.
  • conf:
    • Install socat package on all nodes in common role.
    • Use packages list instead of loop to install MariaDB packages.
    • Enable config_overrides slurmd parameter in Slurm configuration to avoid compute nodes sockets/cores/memory matching configuration check.
    • Move maxtime and state Slurm partitions parameters in params sub-dictionary.
    • Rename slurm_partitions > nodenodes key.
    • Change default Slurm authentication plugin from munge to slurm. This can be changed by setting slurm_with_munge: true in Ansible inventory.
    • Launch slurmrestd with unprivileged system user when JWT authentication is enabled.
    • Adapt slurm role to support Slurm upstream packages on Debian.
  • docs:
    • Explain in manpage ssh command considers admin container by default.
    • Update documentation of --db, --schema, --custom and --slurm-emulator options of conf and restore commands with their new semantics regarding management of cluster settings.
    • Use standard tomllib instead of tomlib external library of documentation Makefile.

Fixed

  • core:
    • Properly handle DBus error when getting containers addresses.
    • Potential key conflict in dictionnary of SSH clients when multiple users connect to the same host with Paramiko library.
    • Set jobs time limit to partition time limit when set to avoid jobs that exceed partition time limit.
    • Remove cluster state on cluster clean.
    • Check ansible playbook RC code and stop execution on failure.
  • lib: Fix firehpc-storage-wrapper start failure due to already existing cluster and home directories.
  • load:
    • Order of partition/qos variables in job submission informational message.
    • Support of Slurm 24.05 sacctmgr show qos --json format to retrieve the list of defined QOS.
    • Redirect jobs output to /dev/null to avoid filling filesystems with tons of inodes (#27).
  • conf:
    • Install mpi packages in parallel instead of sequential loop.
    • Configure system locale to en_US.UTF-8 on rocky8.
    • Add SLURMRESTD_SECURITY=disable_user_check environment variable in slurmrestd service to allow running as slurm user.
    • Containers namespace missing in Slurm-web gateway [ui] > host.
    • Force creation of CA and LDAP certificates to override possibly existing certificates during bootstrap.
    • Ignore cluster creation error in slurmdbd, as it is now automatically created when slurmctld registers to accounting service.
    • Support Rackslab development repository derivatives on RHEL.
    • Add admin hostname with namespace in addition to just the admin hostname in Slurm-web nginx site server names.
    • Replace embedded templates by string concatenations.
  • docs: Various formatting errors in manpage.

Removed

  • conf: Drop DSA SSH host keys.
  • docs: Remove fhpc-emulate-slurm-usage manpage.