Skip to content

Merging v1.157 #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 345 commits into from
Feb 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
345 commits
Select commit Hold shift + click to select a range
feab4cf
Merge branch 'main' into feat/compute-script
bertiethorpe Nov 15, 2024
100632f
bump
wtripp180901 Nov 18, 2024
882ed7e
Don't fail cluster cleanup when prefix not found
bertiethorpe Nov 18, 2024
4710699
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
8b6bf97
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
35d4100
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
127316b
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
b35925d
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
8aea048
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
a40b77c
flatten multiline list of clusters
bertiethorpe Nov 18, 2024
9001626
Update nightly-cleanup.yml
bertiethorpe Nov 18, 2024
f949553
Update delete-cluster.py to allow --force flag
bertiethorpe Nov 18, 2024
b971e7f
Merge pull request #480 from stackhpc/ci/nightly-cleanup-patch
bertiethorpe Nov 18, 2024
0c17410
k9s tags and variable renames
wtripp180901 Nov 19, 2024
92f5115
Merge pull request #441 from stackhpc/feature/k3s-ansible-init
wtripp180901 Nov 19, 2024
e44e084
Make block device detection work on ESXi (#481)
Nov 19, 2024
1e1779c
Merge branch 'main' into feat/compute-script
bertiethorpe Nov 19, 2024
53e43c2
Fix adhoc/rebuild wait_for_connection race condition
bertiethorpe Nov 20, 2024
f23be23
Merge pull request #483 from stackhpc/fix/adhoc-rebuild
bertiethorpe Nov 20, 2024
903e22c
Merge branch 'main' into feat/compute-script
bertiethorpe Nov 20, 2024
a32e309
remove gres.conf - no-op
bertiethorpe Nov 20, 2024
a1f71b6
remove or hardcode some vars, make resolv_conf block conditional
bertiethorpe Nov 20, 2024
61392ed
move EESSI CVMFS install and config to nfs export
bertiethorpe Nov 20, 2024
51b02d3
move manila mount share to nfs export
bertiethorpe Nov 20, 2024
134515d
Pause CI testing for branch feat/compute-script
bertiethorpe Nov 20, 2024
40d9e1f
replaces system repos with ark repos during ci
wtripp180901 Nov 22, 2024
9ef7d69
now uses lookup instead of packer args
wtripp180901 Nov 25, 2024
a6e1243
only applies to RL9 for now
wtripp180901 Nov 25, 2024
3e80268
set up rocky-latest-test builds and ci
bertiethorpe Nov 25, 2024
151746c
bump images
bertiethorpe Nov 25, 2024
9c3301c
CI_CLOUD PR label override for trivy scan
bertiethorpe Nov 25, 2024
b2b2160
bump images
bertiethorpe Nov 25, 2024
0da074b
bump containers.podman collection version
bertiethorpe Nov 25, 2024
5ae1888
bump images
bertiethorpe Nov 25, 2024
b4d2d19
debug site.yml
bertiethorpe Nov 26, 2024
88e23de
mysql latest
bertiethorpe Nov 26, 2024
1eeef37
Bump openhpc role for slurm restart, templating and nodes in multiple…
sjpb Nov 26, 2024
6671d69
bump mysql
bertiethorpe Nov 26, 2024
f66feb9
simplify slurm-init file injection loop
bertiethorpe Nov 27, 2024
6a8266c
clear podman temp files on startup
bertiethorpe Nov 27, 2024
33ffa65
bump new images
bertiethorpe Nov 27, 2024
f4c5cfe
stop using rocky-latest-test images in CI
bertiethorpe Nov 28, 2024
d7a8dd2
low verbosity CI site.yml
bertiethorpe Nov 28, 2024
6faf919
refactored ark role, disabled repos at end of build and modified site…
wtripp180901 Nov 29, 2024
0bc473c
fixed ood install with disbaled repos + fixed ark CRB typo
wtripp180901 Dec 3, 2024
364ec79
fixed eessi install and slurm not loading appliances_mode
wtripp180901 Dec 3, 2024
b0558b9
variables renames + more ansible facts in dnf_repos
wtripp180901 Dec 3, 2024
3131bd6
bump images
wtripp180901 Dec 3, 2024
1be9c6b
added review comment
wtripp180901 Dec 4, 2024
b7670e9
moved config into builder and .stackhpc
wtripp180901 Dec 4, 2024
3de36cf
pull
wtripp180901 Dec 4, 2024
2230bb8
overriding openhpc extra repos in common
wtripp180901 Dec 4, 2024
4de581c
Use rocky 9.4 release train snapshots for builds (#486)
wtripp180901 Dec 4, 2024
9723782
testing builds with leafcloud pulp
wtripp180901 Dec 6, 2024
127b792
pulp integration
wtripp180901 Dec 6, 2024
5b60770
merge conflicts
wtripp180901 Dec 6, 2024
0d8a440
typos
wtripp180901 Dec 6, 2024
90a33fa
missed merge conflict
wtripp180901 Dec 6, 2024
eaa3680
moved pulp port into url
wtripp180901 Dec 6, 2024
9a75656
fixed port not getting added in adhoc
wtripp180901 Dec 6, 2024
741872a
bump
wtripp180901 Dec 6, 2024
39cf556
cleaned up disabling repos + now optional
wtripp180901 Dec 6, 2024
25644c3
typo
wtripp180901 Dec 9, 2024
fef3d56
repos now timestamped + synced at bootstrap
wtripp180901 Dec 11, 2024
1c4a511
refactored pulp_site list
wtripp180901 Dec 11, 2024
558874b
Added extra package installs to bootstrap
wtripp180901 Dec 12, 2024
187bc40
added pulp sync adhoc and temporarily moved out of ci
wtripp180901 Dec 12, 2024
580b0b3
fixed disabling for ci
wtripp180901 Dec 12, 2024
2ed6674
made dnf epel repo more configurable
wtripp180901 Dec 12, 2024
efd2883
Add role to install NVIDIA DOCA on top of an existing "fat" image (#492)
sjpb Dec 12, 2024
d12083a
moved repo enable/disable into fatimage
wtripp180901 Dec 12, 2024
59dd169
merge conflicts
wtripp180901 Dec 12, 2024
07dc9b7
fixed disable repos task
wtripp180901 Dec 12, 2024
3088f83
reverted disable repos task
wtripp180901 Dec 12, 2024
c74360b
fatimage with test latest (REVERT LATER)
wtripp180901 Dec 12, 2024
67ce24b
refactored pulp deploy and added pulp docs
wtripp180901 Dec 12, 2024
c433605
testing image using site pulp
wtripp180901 Dec 12, 2024
bda3f7e
Pointed dnf repos back at ark for now + refactor
wtripp180901 Dec 13, 2024
17d7924
fix doca cleanup deleteing /tmp/ (#494)
sjpb Dec 13, 2024
d6eabe6
unused var
wtripp180901 Dec 13, 2024
4a3074b
prototype script - hostvars no-op
bertiethorpe Dec 13, 2024
5a082e7
Fix nightly images getting timestamp/git hash (#493)
sjpb Dec 13, 2024
91fe707
Update nightlybuild.yml
sjpb Dec 13, 2024
64ddf19
Merge pull request #495 from stackhpc/fix/nightly-img-version-v2
wtripp180901 Dec 13, 2024
e3ce492
use k3s_server metadata for server_ip
bertiethorpe Dec 13, 2024
f0e48b9
pulp sync now mirrors upstream subpaths
wtripp180901 Dec 13, 2024
309bd0b
removed intermediate var
wtripp180901 Dec 13, 2024
a2a705c
Merge branch 'main' into feat/pulp-builds
wtripp180901 Dec 13, 2024
9065bb6
bumped repo timestamps to latest
wtripp180901 Dec 13, 2024
7d7bc73
bump images
wtripp180901 Dec 13, 2024
cc81aef
bump
wtripp180901 Dec 13, 2024
75961b4
Merge branch 'feat/pulp-builds' into update/latest-timestamps
wtripp180901 Dec 13, 2024
f343f67
moved to later in build/site and moved groups
wtripp180901 Dec 13, 2024
d5e8d9a
merge conflicts
wtripp180901 Dec 13, 2024
07ed822
compute init node condition based off metadata
bertiethorpe Dec 13, 2024
a43a5f9
fail gracefully when NFS server not up
bertiethorpe Dec 13, 2024
76f292e
rejoin node to cluster
bertiethorpe Dec 13, 2024
a5cbc58
Merge branch 'main' into feat/compute-script-sb
sjpb Dec 13, 2024
1a400db
ok: Skipping compute initialization as metadata compute_groups is empty
sjpb Dec 13, 2024
c9ebd48
compute-init stage 1 working
sjpb Dec 13, 2024
3a583a9
load hostvars
sjpb Dec 13, 2024
8bb90b4
simplify compute-init file copy
sjpb Dec 13, 2024
7babc21
move compute_init tasks to right place and document
sjpb Dec 14, 2024
cb21e9c
leave compute-init turned on in everything template
sjpb Dec 14, 2024
53a7dc4
get resolv_conf, etc_hosts and stackhpc.openhpc working
sjpb Dec 14, 2024
1f45851
doc problems with templating out hostvars
sjpb Dec 14, 2024
c162e18
Refactored common repolist
wtripp180901 Dec 16, 2024
bda3f0d
Code review doc/comment suggestions
wtripp180901 Dec 16, 2024
6551c33
Merge branch 'main' into update/openhpc-v0.27.9
bertiethorpe Dec 16, 2024
bc5e26e
docs/groups corrections
wtripp180901 Dec 16, 2024
18b220e
moved defaults to CI and updated docs
wtripp180901 Dec 16, 2024
34fee1c
updated docs
wtripp180901 Dec 16, 2024
83161c7
Merge branch 'feat/pulp-builds' into feat/extra-packages
wtripp180901 Dec 16, 2024
9c41725
bump images
wtripp180901 Dec 16, 2024
a435292
bump image
wtripp180901 Dec 16, 2024
d8097c5
Merge branch 'feat/pulp-builds' into feat/extra-packages
wtripp180901 Dec 16, 2024
30a278e
moved to extras
wtripp180901 Dec 16, 2024
4ed69f9
Merge branch 'feat/extra-packages' of github.com:stackhpc/ansible-slu…
wtripp180901 Dec 16, 2024
6c74a1e
repos now controlled by groups + possible during configure + guarded …
wtripp180901 Dec 16, 2024
2357a73
typo
wtripp180901 Dec 16, 2024
bf6f368
bump
wtripp180901 Dec 16, 2024
c6a6bf3
re-enable CI on compute-init script branch
sjpb Dec 17, 2024
5455eec
doc compute_init/export.yml ordering
sjpb Dec 17, 2024
36cf771
change name for compute-init enablement
sjpb Dec 17, 2024
e0f9003
merge conflicts
wtripp180901 Dec 17, 2024
d77be65
Merge branch 'feat/pulp-builds' into feat/extra-packages
wtripp180901 Dec 17, 2024
5e7f809
move most compute-init docs to the role readme
sjpb Dec 17, 2024
11580b3
Remove use of FIPs for leafcloud packer builds (#498)
sjpb Dec 17, 2024
1ba41d8
bump
wtripp180901 Dec 17, 2024
a868642
bump CI image
sjpb Dec 17, 2024
4b0e36d
now performs update in fatimage
wtripp180901 Dec 17, 2024
eeb8838
merge conflicts
wtripp180901 Dec 17, 2024
32cc0c6
Merge pull request #488 from stackhpc/update/openhpc-v0.27.9
bertiethorpe Dec 17, 2024
bc36b78
testing enabling release train for 8.10
wtripp180901 Dec 17, 2024
a9e53ba
Temporarily (?) building from rocky 8 genericcloud + update in fatimage
wtripp180901 Dec 17, 2024
47b7bb3
bump
wtripp180901 Dec 17, 2024
7fe3ca5
docs suggestions
wtripp180901 Dec 17, 2024
1faf4e5
stopped openhpc overwriting epel 8
wtripp180901 Dec 17, 2024
a3e1258
Merge branch 'main' into feat/pulp-builds
wtripp180901 Dec 17, 2024
f032ed9
Merge pull request #490 from stackhpc/feat/pulp-builds
wtripp180901 Dec 17, 2024
b6ae045
Merge branch 'main' into feat/compute-script-sb
sjpb Dec 17, 2024
6ce4953
fixed broken powertools repo
wtripp180901 Dec 18, 2024
a99d8be
Merge branch 'feat/rocky8-release-train' of github.com:stackhpc/ansib…
wtripp180901 Dec 18, 2024
29a1579
bump
wtripp180901 Dec 18, 2024
5effb3f
Merge branch 'main' into update/latest-timestamps
wtripp180901 Dec 18, 2024
8a754d0
merging 9.5 podman fixes
wtripp180901 Dec 18, 2024
ee4ab93
bump
wtripp180901 Dec 18, 2024
82ef12b
support nfs for compute-init
sjpb Dec 18, 2024
9049d30
fix compute-init README typos
sjpb Dec 18, 2024
79f52f9
fix typo in resolv_conf metadata
sjpb Dec 18, 2024
5438150
added 9.5 ark snapshots + bumped genericcloud
wtripp180901 Dec 18, 2024
e7c96ad
bump
wtripp180901 Dec 18, 2024
6140caa
Merge branch 'update/latest-timestamps' into feat/rocky-9.5-release-t…
wtripp180901 Dec 18, 2024
4f81b89
fix nfs and make openhpc fully-capable in compute-init
sjpb Dec 18, 2024
9859d54
make compute-init safe for rerunning ansible-init
sjpb Dec 18, 2024
e0d0c06
support manila in compute-init
sjpb Dec 18, 2024
68bec3e
test manila if running on arcus
sjpb Dec 18, 2024
14e7dc6
support basic_users in compute-init
sjpb Dec 18, 2024
a2418ef
support eessi in compute-init
sjpb Dec 18, 2024
cdbf005
change metadat from k3s_server to control_address
sjpb Dec 18, 2024
3b9eb46
fixup resolv_conf support in cloud-init
sjpb Dec 18, 2024
15ed0a3
Bump RL9.4 repo timestamps to latest snapshots (#497)
wtripp180901 Dec 18, 2024
fed2d6e
Pin nvidia-driver and cuda packages to working packages (#496)
sjpb Dec 19, 2024
722a0c1
moved pulp subpaths into common structure
wtripp180901 Dec 19, 2024
d1f3c69
typos
wtripp180901 Dec 19, 2024
357f7e2
docs suggestions
wtripp180901 Dec 19, 2024
7f84fed
merge conflicts
wtripp180901 Dec 19, 2024
17499e7
dnf_repos urls fully overridable again
wtripp180901 Dec 19, 2024
1e2e6d8
bump
wtripp180901 Dec 19, 2024
6a8ecda
variable renames from review
wtripp180901 Dec 19, 2024
ef33eef
updated docs
wtripp180901 Dec 19, 2024
a3be506
missed variable rename
wtripp180901 Dec 20, 2024
8059d24
Merge pull request #499 from stackhpc/feat/extra-packages
bertiethorpe Dec 20, 2024
a2dde14
Merge branch 'main' into feat/compute-script-sb
bertiethorpe Dec 20, 2024
533d7c5
integrated with refactor
wtripp180901 Dec 20, 2024
484c54a
integrated refactor
wtripp180901 Dec 20, 2024
a0ba5f1
bump fatimage
bertiethorpe Dec 20, 2024
3d91816
Merge pull request #500 from stackhpc/feat/compute-script-sb
bertiethorpe Dec 20, 2024
ada3dc9
Review linting changes
wtripp180901 Jan 2, 2025
20cc4ab
Merge pull request #507 from stackhpc/refactor/pulp-urls
wtripp180901 Jan 2, 2025
a769015
Add note about login node reboot (#510)
sd109 Jan 2, 2025
84aacac
merge conflicts
wtripp180901 Jan 2, 2025
8c98e16
merge conflicts
wtripp180901 Jan 2, 2025
77cfc70
merge with rocky 8 support
wtripp180901 Jan 2, 2025
36ca0d5
bump
wtripp180901 Jan 2, 2025
5fddb85
bump
wtripp180901 Jan 2, 2025
9c4d576
Merge branch 'feat/rocky8-release-train' into feat/rocky-9.5-release-…
wtripp180901 Jan 2, 2025
1b94cfc
Merge pull request #501 from stackhpc/feat/rocky8-release-train
wtripp180901 Jan 2, 2025
3327898
Merge branch 'main' into feat/rocky-9.5-release-train
wtripp180901 Jan 2, 2025
fd6eb4f
Merge pull request #503 from stackhpc/feat/rocky-9.5-release-train
wtripp180901 Jan 2, 2025
9a07ff4
Stop Lustre deleting rdma packages + add to extrabuild test (#502)
wtripp180901 Jan 3, 2025
8c979cd
Fix python/ansible/pulp squeezer versions for RL8 deploy hosts (#516)
sjpb Jan 3, 2025
510cfd0
extend cookiecutter terraform config for compute init script
bertiethorpe Jan 6, 2025
a08f984
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 6, 2025
4def5ba
Add Release Train OpenHPC repos (#515)
wtripp180901 Jan 7, 2025
8290a31
define default compute init flags
bertiethorpe Jan 7, 2025
b820632
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 7, 2025
354ce1e
add CI tests for compute node rebuilds
bertiethorpe Jan 7, 2025
b903cdd
document metadata toggle flags and CI workflow
bertiethorpe Jan 7, 2025
50fc320
Update ceph to use ark packages and move RL9 to ceph reef (#519)
wtripp180901 Jan 8, 2025
781c2d4
Add more information re. configuring production sites (#508)
sjpb Jan 8, 2025
a1e5bd7
Reworked persist_hostkeys role to use common set of persistent keys f…
wtripp180901 Jan 8, 2025
fa028f9
removed unnescessary caas config
wtripp180901 Jan 8, 2025
001c459
updated docs
wtripp180901 Jan 8, 2025
2bea51c
review suggestions
bertiethorpe Jan 8, 2025
def6bc3
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 8, 2025
dc58a25
Change defaults so a cookiecutter environment is fully functional (#473)
wtripp180901 Jan 8, 2025
038ddf7
add delay for ansible-init to finish
bertiethorpe Jan 9, 2025
ed810f2
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
69d9cd8
Fix epel not using Ark repos for RL8 (#526)
wtripp180901 Jan 9, 2025
6929272
fix volume_backed_instances not working for compute nodes (#527)
sjpb Jan 9, 2025
4652c34
typo
wtripp180901 Jan 9, 2025
f021167
comment update
wtripp180901 Jan 9, 2025
895f302
Merge branch 'main' into feat/hostkey-secrets
wtripp180901 Jan 9, 2025
08eff97
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
7057c50
remove delay in compute node rebuild ci
bertiethorpe Jan 9, 2025
b93e3c7
Merge pull request #525 from stackhpc/feat/hostkey-secrets
wtripp180901 Jan 9, 2025
6d992bf
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
3faa813
fix compute init metadata flags
bertiethorpe Jan 9, 2025
a7876a6
Support additional volumes on compute nodes (#528)
sjpb Jan 9, 2025
bc16dba
bump image
bertiethorpe Jan 9, 2025
68561b4
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
2903223
Support SSSD and optionally LDAP (#438)
sjpb Jan 9, 2025
5193ba2
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
d2e18d0
bump image
bertiethorpe Jan 10, 2025
3b09bd1
Fix various typos in documentation
priteau Jan 10, 2025
438ed3a
adjust check_slurm logic to deal with idle* state
bertiethorpe Jan 10, 2025
37c1dce
Fix nightly cleanup to deal with duplicate server names
bertiethorpe Jan 13, 2025
9b1bf12
Update nightly-cleanup.yml
bertiethorpe Jan 13, 2025
f1fd75e
Update nightly-cleanup.yml
bertiethorpe Jan 13, 2025
edbcebc
Fix tag determination
bertiethorpe Jan 13, 2025
fd5cbf9
pause in workflow to debug slurm state
bertiethorpe Jan 14, 2025
f661c7f
debug wait on failure
bertiethorpe Jan 14, 2025
662f5ef
Merge pull request #532 from stackhpc/fix/nightly-cleanup
bertiethorpe Jan 14, 2025
c95006b
Merge pull request #530 from stackhpc/fix-typos
sd109 Jan 14, 2025
329e054
Fix environment creation steps
priteau Jan 14, 2025
2cac614
Merge pull request #531 from stackhpc/fix-environment-creation
sd109 Jan 14, 2025
81c316a
allow empty compute_init_enable list
bertiethorpe Jan 14, 2025
bccc88b
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 14, 2025
9897f29
bump images
bertiethorpe Jan 14, 2025
c7722c1
Merge pull request #518 from stackhpc/feat/compute-init-cookiecutter
bertiethorpe Jan 15, 2025
257e685
Document required security groups (#534)
priteau Jan 15, 2025
e8f1cbe
Bump Zenith client to latest from azimuth-cloud namespace (#437)
m-bull Jan 15, 2025
1e5e105
fix yaml formatting in operations docs
sjpb Jan 15, 2025
a347b90
Merge pull request #535 from stackhpc/docs/ops-yaml-typo
bertiethorpe Jan 15, 2025
5f7e48f
Enable image builds to install extra packages by default (#536)
sjpb Jan 15, 2025
fd0c7f6
merge conflicts
wtripp180901 Jan 17, 2025
4254e28
moved vtest pre-hook to nrel
wtripp180901 Jan 30, 2025
445ee45
Merge branch 'vtest-v1.157-cleanup' into vtest-v1.157-merge
wtripp180901 Jan 30, 2025
11bcf83
Merge branch 'vtest-v1.157-cleanup' into vtest-v1.157-merge
wtripp180901 Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions .github/workflows/extra.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
name: Test extra build
on:
workflow_dispatch:
push:
branches:
- main
paths:
- 'environments/.stackhpc/terraform/cluster_image.auto.tfvars.json'
- 'ansible/roles/doca/**'
- 'ansible/roles/cuda/**'
- 'ansible/roles/lustre/**'
- '.github/workflows/extra.yml'
pull_request:
paths:
- 'environments/.stackhpc/terraform/cluster_image.auto.tfvars.json'
- 'ansible/roles/doca/**'
- 'ansible/roles/cuda/**'
- 'ansible/roles/lustre/**'
- '.github/workflows/extra.yml'

jobs:
doca:
name: extra-build
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.build.image_name }} # to branch/PR + OS
cancel-in-progress: true
runs-on: ubuntu-22.04
strategy:
fail-fast: false # allow other matrix jobs to continue even if one fails
matrix: # build RL8, RL9
build:
- image_name: openhpc-extra-RL8
source_image_name_key: RL8 # key into environments/.stackhpc/terraform/cluster_image.auto.tfvars.json
inventory_groups: doca,cuda,lustre
volume_size: 30 # needed for cuda
- image_name: openhpc-extra-RL9
source_image_name_key: RL9
inventory_groups: doca,cuda,lustre
volume_size: 30 # needed for cuda
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
CI_CLOUD: ${{ vars.CI_CLOUD }} # default from repo settings
ARK_PASSWORD: ${{ secrets.ARK_PASSWORD }}

steps:
- uses: actions/checkout@v2

- name: Load current fat images into GITHUB_ENV
# see https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#example-of-a-multiline-string
run: |
{
echo 'FAT_IMAGES<<EOF'
cat environments/.stackhpc/terraform/cluster_image.auto.tfvars.json
echo EOF
} >> "$GITHUB_ENV"

- name: Record settings
run: |
echo CI_CLOUD: ${{ env.CI_CLOUD }}
echo FAT_IMAGES: ${FAT_IMAGES}

- name: Setup ssh
run: |
set -x
mkdir ~/.ssh
echo "${{ secrets[format('{0}_SSH_KEY', env.CI_CLOUD)] }}" > ~/.ssh/id_rsa
chmod 0600 ~/.ssh/id_rsa
shell: bash

- name: Add bastion's ssh key to known_hosts
run: cat environments/.stackhpc/bastion_fingerprints >> ~/.ssh/known_hosts
shell: bash

- name: Install ansible etc
run: dev/setup-env.sh

- name: Write clouds.yaml
run: |
mkdir -p ~/.config/openstack/
echo "${{ secrets[format('{0}_CLOUDS_YAML', env.CI_CLOUD)] }}" > ~/.config/openstack/clouds.yaml
shell: bash

- name: Setup environment
run: |
. venv/bin/activate
. environments/.stackhpc/activate

- name: Build fat image with packer
id: packer_build
run: |
set -x
. venv/bin/activate
. environments/.stackhpc/activate
cd packer/
packer init .

PACKER_LOG=1 packer build \
-on-error=${{ vars.PACKER_ON_ERROR }} \
-var-file=$PKR_VAR_environment_root/${{ env.CI_CLOUD }}.pkrvars.hcl \
-var "source_image_name=${{ fromJSON(env.FAT_IMAGES)['cluster_image'][matrix.build.source_image_name_key] }}" \
-var "image_name=${{ matrix.build.image_name }}" \
-var "inventory_groups=${{ matrix.build.inventory_groups }}" \
-var "volume_size=${{ matrix.build.volume_size }}" \
openstack.pkr.hcl

- name: Get created image names from manifest
id: manifest
run: |
. venv/bin/activate
IMAGE_ID=$(jq --raw-output '.builds[-1].artifact_id' packer/packer-manifest.json)
while ! openstack image show -f value -c name $IMAGE_ID; do
sleep 5
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo "image-name=${IMAGE_NAME}" >> "$GITHUB_OUTPUT"
echo "image-id=$IMAGE_ID" >> "$GITHUB_OUTPUT"
echo $IMAGE_ID > image-id.txt
echo $IMAGE_NAME > image-name.txt

- name: Make image usable for further builds
run: |
. venv/bin/activate
openstack image unset --property signature_verified "${{ steps.manifest.outputs.image-id }}"

- name: Delete image for automatically-run workflows
run: |
. venv/bin/activate
openstack image delete "${{ steps.manifest.outputs.image-id }}"
if: ${{ github.event_name != 'workflow_dispatch' }}

- name: Upload manifest artifact
uses: actions/upload-artifact@v4
with:
name: image-details-${{ matrix.build.image_name }}
path: |
./image-id.txt
./image-name.txt
overwrite: true
50 changes: 22 additions & 28 deletions .github/workflows/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,36 +15,25 @@ jobs:
openstack:
name: openstack-imagebuild
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os_version }}-${{ matrix.build }} # to branch/PR + OS + build
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.build.image_name }} # to branch/PR + OS
cancel-in-progress: true
runs-on: ubuntu-22.04
strategy:
fail-fast: false # allow other matrix jobs to continue even if one fails
matrix: # build RL8+OFED, RL9+OFED, RL9+OFED+CUDA versions
os_version:
- RL8
- RL9
matrix: # build RL8, RL9
build:
- openstack.openhpc
- openstack.openhpc-cuda
exclude:
- os_version: RL8
build: openstack.openhpc-cuda
- image_name: openhpc-RL8
source_image_name: Rocky-8-GenericCloud-Base-8.10-20240528.0.x86_64.qcow2
inventory_groups: control,compute,login,update
- image_name: openhpc-RL9
source_image_name: Rocky-9-GenericCloud-Base-9.5-20241118.0.x86_64.qcow2
inventory_groups: control,compute,login,update
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
CI_CLOUD: ${{ github.event.inputs.ci_cloud }}
SOURCE_IMAGES_MAP: |
{
"RL8": {
"openstack.openhpc": "rocky-latest-RL8",
"openstack.openhpc-cuda": "rocky-latest-cuda-RL8"
},
"RL9": {
"openstack.openhpc": "rocky-latest-RL9",
"openstack.openhpc-cuda": "rocky-latest-cuda-RL9"
}
}
ARK_PASSWORD: ${{ secrets.ARK_PASSWORD }}
LEAFCLOUD_PULP_PASSWORD: ${{ secrets.LEAFCLOUD_PULP_PASSWORD }}

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -90,13 +79,11 @@ jobs:

PACKER_LOG=1 packer build \
-on-error=${{ vars.PACKER_ON_ERROR }} \
-only=${{ matrix.build }} \
-var-file=$PKR_VAR_environment_root/${{ env.CI_CLOUD }}.pkrvars.hcl \
-var "source_image_name=${{ env.SOURCE_IMAGE }}" \
-var "source_image_name=${{ matrix.build.source_image_name }}" \
-var "image_name=${{ matrix.build.image_name }}" \
-var "inventory_groups=${{ matrix.build.inventory_groups }}" \
openstack.pkr.hcl
env:
PKR_VAR_os_version: ${{ matrix.os_version }}
SOURCE_IMAGE: ${{ fromJSON(env.SOURCE_IMAGES_MAP)[matrix.os_version][matrix.build] }}

- name: Get created image names from manifest
id: manifest
Expand All @@ -107,14 +94,21 @@ jobs:
sleep 5
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo "image-name=${IMAGE_NAME}" >> "$GITHUB_OUTPUT"
echo "image-id=$IMAGE_ID" >> "$GITHUB_OUTPUT"
echo $IMAGE_ID > image-id.txt
echo $IMAGE_NAME > image-name.txt

- name: Make image usable for further builds
run: |
. venv/bin/activate
openstack image unset --property signature_verified "${{ steps.manifest.outputs.image-id }}"

- name: Upload manifest artifact
uses: actions/upload-artifact@v4
with:
name: image-details-${{ matrix.build }}-${{ matrix.os_version }}
name: image-details-${{ matrix.build.image_name }}
path: |
./image-id.txt
./image-name.txt
overwrite: true
overwrite: true
60 changes: 43 additions & 17 deletions .github/workflows/nightly-cleanup.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,8 @@
name: Cleanup CI clusters
on:
workflow_dispatch:
inputs:
ci_cloud:
description: 'Select the CI_CLOUD'
required: true
type: choice
options:
- LEAFCLOUD
- SMS
- ARCUS
schedule:
- cron: '0 20 * * *' # Run at 8PM - image sync runs at midnight
- cron: '0 21 * * *' # Run at 9PM - image sync runs at midnight

jobs:
ci_cleanup:
Expand Down Expand Up @@ -52,20 +43,55 @@ jobs:
- name: Find CI clusters
run: |
. venv/bin/activate
CI_CLUSTERS=$(openstack server list | grep --only-matching 'slurmci-RL.-[0-9]\+' | sort | uniq)
echo "ci_clusters=${CI_CLUSTERS}" >> GITHUB_ENV
CI_CLUSTERS=$(openstack server list | grep --only-matching 'slurmci-RL.-[0-9]\+' | sort | uniq || true)
echo "DEBUG: Raw CI clusters: $CI_CLUSTERS"

if [[ -z "$CI_CLUSTERS" ]]; then
echo "No matching CI clusters found."
else
# Flatten multiline value so can be passed as env var
CI_CLUSTERS_FORMATTED=$(echo "$CI_CLUSTERS" | tr '\n' ' ' | sed 's/ $//')
echo "DEBUG: Formatted CI clusters: $CI_CLUSTERS_FORMATTED"
echo "ci_clusters=$CI_CLUSTERS_FORMATTED" >> $GITHUB_ENV
fi
shell: bash

- name: Delete clusters if control node not tagged with keep
run: |
. venv/bin/activate
for cluster_prefix in ${CI_CLUSTERS}
if [[ -z ${ci_clusters} ]]; then
echo "No clusters to delete."
exit 0
fi

for cluster_prefix in ${ci_clusters}
do
TAGS=$(openstack server show ${cluster_prefix}-control --column tags --format value)
if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} - control instance is tagged as keep"
echo "Processing cluster: $cluster_prefix"
# Get all servers with the matching name for control node
CONTROL_SERVERS=$(openstack server list --name ${cluster_prefix}-control --format json)
SERVER_COUNT=$(echo "$CONTROL_SERVERS" | jq length)

if [[ $SERVER_COUNT -gt 1 ]]; then
echo "Multiple servers found for control node '${cluster_prefix}-control'. Checking tags for each..."

for server in $(echo "$CONTROL_SERVERS" | jq -r '.[].ID'); do
# Get tags for each control node
TAGS=$(openstack server show "$server" --column tags --format value)

if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} (server ${server}) - control instance is tagged as keep"
else
./dev/delete-cluster.py ${cluster_prefix} --force
fi
done
else
yes | ./dev/delete-cluster.py ${cluster_prefix}
# If only one server, extract its tags and proceed
TAGS=$(echo "$CONTROL_SERVERS" | jq -r '.[0].Tags')
if [[ $TAGS =~ "keep" ]]; then
echo "Skipping ${cluster_prefix} - control instance is tagged as keep"
else
./dev/delete-cluster.py ${cluster_prefix} --force
fi
fi
done
shell: bash
Loading