CI: Add regression test for A100s #93

julianakwan · 2025-02-27T16:29:40Z

This PR adds the Binary BH regression test to the GitLab CI script and also updates the current test build with the USE_HDF5 = TRUE flag (so that those tests can be carried out as well). I also wrapped the srun commands with flock to prevent more than one job submitting to the interactive queue at time, which will cause the srun to fail.

NB: This only closes Issue #95 because I would still like to add a regression test for the PVC build.

mirenradia

Thanks Juliana for this PR. I know that you've not had the most fun experience with the GitLab CI/CD pipelines!

Just a couple of minor changes.

Ideally I would like to split this pipeline up into multiple stages (e.g. build and then run) but, if we want to re-use the previously built binaries, we need support for "artifacts". This requires the gitlab-runner command to be available on CSD3 so let's defer this for future work.

.gitlab-ci.yml

mirenradia · 2025-03-03T15:20:40Z

For some reason, #76 has been added to this PR and I can't seem to be able to remove it.

julianakwan · 2025-03-12T22:32:03Z

Thanks for your review @mirenradia !

Summary of changes:

Version of HDF5 loaded is now compatible with RHEL8. However, it was built with MPI so I now run the tests with USE_MPI=TRUE. Also this means that I had to change GNUmakefile to use the environment variableSLURM_NTASKS if this is defined (it was previously hard coded to 2 ranks). It also means I had to override the system version of HDF5 by defining HDF5_HOME.
I have refactored the definition of LOCKED_FILE to QOS_INTR_LOCK_FILE and changed it's location to $HOME
I have fixed the location of AMREX_HOME in case of control flowing to the else statement

mirenradia · 2025-03-17T09:54:14Z

I think something has gone wrong with the history on this branch and a load of old commits have been brought in. Would you be able to fix it?

mirenradia · 2025-03-17T13:19:54Z

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command

nvidia-cuda-mps-control -d

in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

mirenradia · 2025-03-17T13:35:55Z

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

julianakwan · 2025-03-17T14:17:53Z

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

For whatever reason, GitHub will not let me unlink #76 (it is greyed out for me). I can unlink #95 but that's the opposite of what we want!

mirenradia · 2025-03-17T14:21:20Z

Maybe if you edit the original description to remove the "close Issue 76" bit?

julianakwan · 2025-03-17T14:22:50Z

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command
nvidia-cuda-mps-control -d
in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

I think the regression test is pretty small so probably not a great representation of our capacity for GPU workloads...I think ideally we would have something like the ExCALIBUR Spack/ReFrame testing as part of our CI.

mirenradia · 2025-03-17T14:43:53Z

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

It doesn't but we might as well use 2 tasks, particularly for the regression test.

For the unit tests launch line, can you try launching with salloc?

mirenradia · 2025-03-17T17:03:53Z

I think we can still stick to 1 GPU (I had assumed this before hence why I suggested Nvidia MPS)? AMReX will complain but it might speed up throughput of jobs in the queue.

mirenradia · 2025-03-21T17:41:50Z

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk. A couple of workarounds I can think of:

Add sleep 30 after the srun command which runs the regression test. This improves the reliability but I think it might still fail occasionally.
The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

mirenradia · 2025-03-31T10:36:12Z

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists. I guess we'd need to put a timeout so that it doesn't loop for the full 1h of GitLab CI pipeline time limit if the file doesn't get written for whatever reason.

julianakwan · 2025-04-15T12:40:10Z

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk.

Did the service desk give you any advice on your ticket?

1. Add `sleep 30` after the `srun` command which runs the regression test. This improves the reliability but I think it might still fail occasionally.

2. The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

I can try these in the meantime. I think some combination of 1. and

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists.

might work

mirenradia · 2025-04-15T13:31:22Z

Did the service desk give you any advice on your ticket?

They suggested using the sync command or the fsync() C function. The former didn't help and the latter requires you to being use the C file API and have access to a file descriptor which C++ doesn't give you.

I think we will just have to use [some] of the workarounds I suggested above.

mirenradia · 2025-04-17T09:36:09Z

I've just installed ccache to the CSD3 service account. Would you be to add USE_CCACHE=TRUE to the BUILD_CONFIG?

To share built items between stages, I've moved them to a folder outside of the usual GitLab CI location. This is because the execs are too large to use GitLab artifacts.

This will also add a formatting check for shell scripts that is consistent with the GitLab style guide as a pre-commit hook

The directory name now contains the CI_PIPELINE_ID variable so subsequent jobs won't overwrite the contents of the directory. I also added an extra line to the test stage to remove the directory afterwards.

I believe BINARY_DIR is a more descriptive name for the location of the executables produced in the test stage. I've also placed the removal of this directory in the specific after_script of the test stage.

This will move the definition of BINARY_DIR to the before_script so there is no need for a dotenv artifact. Also I changed the ordering of the while loop so it is hopefully clearer when things go wrong with the filesystem.

julianakwan · 2025-05-22T10:47:41Z

Could you use $CI_PIPELINE_IID (or $CI_PIPELINE_ID) and a100 in the STORE variable so that it's unique for each pipeline (and partition for when we do #96)? I think we should clean this directory in after_script.

Could you also move some of the extra YAML files/scripts into a new .gitlab subdirectory so that the top level directory is a bit cleaner? (I'm sure @KAClough will be happy about another top level directory 😈)

I've resolved both of these issues. There is now a separate directory called .gitlab which contains .gitlab-ci-common.yml
Also:

The plot files now output to the RDS DiRAC dp002 storage directory in a special directory called grteclyn-ci-svc. It is write protected from other members of dp002 so no one should be able to delete our stuff. Each run will output to it's own subdirectory called a100-test-$CI_PIPELINE_ID so named because the job is called a100-test. I propose to use the naming convention job/stage name-$CI_PIPELINE_ID but I am open to other suggestions.
I have done some refactoring of the environment variables. $STORE has been renamed to $BINARY_DIR. Note that $HOME is undefined under global variables: so any variable that needs it should be defined in the script or before_script section. The refactoring also means that I have removed artifacts altogether ($BINARY_DIR is defined in the common before_script inside .gitlab/.gitlab-ci-common.yml)
I have done some refactoring of the loops to make it clearer that YAML treats each loop as a single line. I removed the - in front and replaced them with the multiline block folded style > character which replaces each new line with a space. I found this to be helpful for my own understanding, so maybe it will be of use for other people starting out with CI.
I removed the common after_script, since csd3-a100-test needs its own set of clean up (for builds and the outputs) and any specific after_script defined within a stage will overwrite the common after_script. (You cannot get it to do a merge, I have checked). Also note that variables defined in the before_script or script sections cannot be accessed in the after_script as this is run as a separate instance. I think the variables here could be tidied up so I am open to suggestions, but I do not believe the after_script will allow artifacts. The tidyup needs to go in the after_script and not the script section because it needs to remove the stuff we've made regardless of whether the job fails or not.

The output directory is not being deleted because the if statement is never evaluated

mirenradia

Why did you add .gitlab-ci-dawn.yml in b82aa8d? I don't think it's used elsewhere so, to keep the history clean, can you remove it from this PR (feel free to move it onto a separate branch for when we tackle #96)?

Ideally it would be nice to move the building of the test executable to the "build" stage but, without resorting to using artifacts again, I'm not sure how to do that given that it needs to be run from where it is built in the Tests subdirectory.

The directory storing the binaries and the outputs weren't being deleted because the after_script doesn't have access to environment variables defined in the before_script and in the script

julianakwan · 2025-06-13T15:57:09Z

Why did you add .gitlab-ci-dawn.yml in b82aa8d? I don't think it's used elsewhere so, to keep the history clean, can you remove it from this PR (feel free to move it onto a separate branch for when we tackle #96)?

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Ideally it would be nice to move the building of the test executable to the "build" stage but, without resorting to using artifacts again, I'm not sure how to do that given that it needs to be run from where it is built in the Tests subdirectory.

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

mirenradia · 2025-06-16T10:03:21Z

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Weird. Hopefully it's fixed now?

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Yes, maybe this is a bit better.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

Actually, I think the token is only used by GitLab to send the pipeline status to GitHub. If you look at your last commit, the status of the GitLab pipeline is no longer part of the status checks.

julianakwan · 2025-06-16T15:02:49Z

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Weird. Hopefully it's fixed now?

Yeah! I have .gitlab-ci-dawn.yml on a separate branch now.

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Yes, maybe this is a bit better.

Done! The test build is now in the build stage and then the test stage moves it back to the right directory.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

Actually, I think the token is only used by GitLab to send the pipeline status to GitHub. If you look at your last commit, the status of the GitLab pipeline is no longer part of the status checks.

Thanks - good to know. Yeah, I noticed that it wasn't being tracked anymore.

mirenradia · 2025-06-18T08:39:46Z

Could you remove the trailing whitespaces in .gitlab-ci.yml? clang-format picks this up for C++ source code but I guess we should add a pre-commit/CI check for other files.

julianakwan · 2025-06-23T17:36:34Z

Could you remove the trailing whitespaces in .gitlab-ci.yml?

Fixed.

clang-format picks this up for C++ source code but I guess we should add a pre-commit/CI check for other files.

This belongs in a separate PR. I have two solutions: test/amrex-style-files and test/yamllint. I have opened a draft PR for each (#123 and #124) and you can pick the one you like the best.

mirenradia

Looks good to me. Sorry about the delay in reviewing this.

julianakwan added the enhancement New feature or request label Feb 27, 2025

julianakwan self-assigned this Feb 27, 2025

julianakwan requested a review from mirenradia as a code owner February 27, 2025 16:29

mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request

Add regression test to CSD3 A100 pipeline #95

Closed

mirenradia removed a link to an issue Mar 3, 2025

Add regression test to CSD3 A100 pipeline #95

Closed

mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request

Add regression test to CSD3 A100 pipeline #95

Closed

julianakwan changed the title ~~Enhancement/gpu regression test~~ CI: Add regression test for A100s Mar 3, 2025

mirenradia requested changes Mar 3, 2025

View reviewed changes

.gitlab-ci.yml Outdated Show resolved Hide resolved

.gitlab-ci.yml Outdated Show resolved Hide resolved

.gitlab-ci.yml Outdated Show resolved Hide resolved

.gitlab-ci.yml Outdated Show resolved Hide resolved

.gitlab-ci.yml Outdated Show resolved Hide resolved

julianakwan force-pushed the enhancement/gpu-regression-test branch from d607f84 to 81471d6 Compare March 11, 2025 18:35

julianakwan requested a review from mirenradia March 12, 2025 22:32

julianakwan force-pushed the enhancement/gpu-regression-test branch from 1539a28 to 209132c Compare March 17, 2025 11:13

julianakwan force-pushed the enhancement/gpu-regression-test branch 2 times, most recently from 6dbbbb4 to 47bf6dd Compare April 15, 2025 14:15

mirenradia mentioned this pull request Apr 15, 2025

BlackHoles: Redistribute particles before writing plotfile #116

Merged

julianakwan force-pushed the enhancement/gpu-regression-test branch from c89c7ef to b4ef2d1 Compare April 16, 2025 10:10

julianakwan added 5 commits May 16, 2025 17:01

CI: Move built items to build folder

c294cac

To share built items between stages, I've moved them to a folder outside of the usual GitLab CI location. This is because the execs are too large to use GitLab artifacts.

CI: Move while loop to shell script

b5d834a

This will also add a formatting check for shell scripts that is consistent with the GitLab style guide as a pre-commit hook

CI: Move GitLab CI yaml files to own directory

f0bdda3

CI: Make the location of STORE unique to each pipeline

0ed0126

The directory name now contains the CI_PIPELINE_ID variable so subsequent jobs won't overwrite the contents of the directory. I also added an extra line to the test stage to remove the directory afterwards.

CI: Rename STORE to BINARY_DIR

96d9c0c

I believe BINARY_DIR is a more descriptive name for the location of the executables produced in the test stage. I've also placed the removal of this directory in the specific after_script of the test stage.

julianakwan force-pushed the enhancement/gpu-regression-test branch from 748327e to 96d9c0c Compare May 16, 2025 16:01

CI: Refactor while loop and BINARY_DIR

ef3cd3a

This will move the definition of BINARY_DIR to the before_script so there is no need for a dotenv artifact. Also I changed the ordering of the while loop so it is hopefully clearer when things go wrong with the filesystem.

julianakwan force-pushed the enhancement/gpu-regression-test branch 5 times, most recently from e826f82 to 43f4ee3 Compare May 21, 2025 10:22

julianakwan added 2 commits May 22, 2025 15:08

CI: Store outputs in RDS dp002 directory

b82aa8d

CI: Fix loop formatting

21d121a

The output directory is not being deleted because the if statement is never evaluated

julianakwan force-pushed the enhancement/gpu-regression-test branch from 3d168a0 to b5e20b5 Compare May 22, 2025 14:08

mirenradia requested changes Jun 13, 2025

View reviewed changes

CI: Refactor environment variables

d2de39f

The directory storing the binaries and the outputs weren't being deleted because the after_script doesn't have access to environment variables defined in the before_script and in the script

julianakwan force-pushed the enhancement/gpu-regression-test branch from b5e20b5 to d2de39f Compare June 13, 2025 15:51

julianakwan requested a review from mirenradia June 16, 2025 14:58

CI: Move test build into build stage

8ece58a

julianakwan force-pushed the enhancement/gpu-regression-test branch from bc212b5 to 8ece58a Compare June 23, 2025 16:40

mirenradia approved these changes Jun 25, 2025

View reviewed changes

mirenradia merged commit 2ef4d7e into develop Jun 25, 2025
56 checks passed

mirenradia deleted the enhancement/gpu-regression-test branch June 25, 2025 10:42

CI: Add regression test for A100s #93

CI: Add regression test for A100s #93

Uh oh!

Conversation

julianakwan commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirenradia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mirenradia commented Mar 3, 2025

Uh oh!

julianakwan commented Mar 12, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

julianakwan commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

julianakwan commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 17, 2025

Uh oh!

mirenradia commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirenradia commented Mar 31, 2025

Uh oh!

julianakwan commented Apr 15, 2025

Uh oh!

mirenradia commented Apr 15, 2025

Uh oh!

mirenradia commented Apr 17, 2025

Uh oh!

julianakwan commented May 22, 2025

Uh oh!

mirenradia left a comment

Choose a reason for hiding this comment

Uh oh!

julianakwan commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirenradia commented Jun 16, 2025

Uh oh!

julianakwan commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirenradia commented Jun 18, 2025

Uh oh!

julianakwan commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirenradia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

julianakwan commented Feb 27, 2025 •

edited

Loading

mirenradia commented Mar 21, 2025 •

edited

Loading

julianakwan commented Jun 13, 2025 •

edited

Loading

julianakwan commented Jun 16, 2025 •

edited

Loading

julianakwan commented Jun 23, 2025 •

edited

Loading