Skip to content

CI: Add regression test for A100s #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 25, 2025

Conversation

julianakwan
Copy link
Contributor

@julianakwan julianakwan commented Feb 27, 2025

This PR adds the Binary BH regression test to the GitLab CI script and also updates the current test build with the USE_HDF5 = TRUE flag (so that those tests can be carried out as well). I also wrapped the srun commands with flock to prevent more than one job submitting to the interactive queue at time, which will cause the srun to fail.

NB: This only closes Issue #95 because I would still like to add a regression test for the PVC build.

@julianakwan julianakwan added the enhancement New feature or request label Feb 27, 2025
@julianakwan julianakwan self-assigned this Feb 27, 2025
@mirenradia mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request
@mirenradia mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request
@julianakwan julianakwan changed the title Enhancement/gpu regression test CI: Add regression test for A100s Mar 3, 2025
Copy link
Member

@mirenradia mirenradia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Juliana for this PR. I know that you've not had the most fun experience with the GitLab CI/CD pipelines!

Just a couple of minor changes.

Ideally I would like to split this pipeline up into multiple stages (e.g. build and then run) but, if we want to re-use the previously built binaries, we need support for "artifacts". This requires the gitlab-runner command to be available on CSD3 so let's defer this for future work.

@mirenradia
Copy link
Member

For some reason, #76 has been added to this PR and I can't seem to be able to remove it.

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from d607f84 to 81471d6 Compare March 11, 2025 18:35
@julianakwan
Copy link
Contributor Author

Thanks for your review @mirenradia !

Summary of changes:

  • Version of HDF5 loaded is now compatible with RHEL8. However, it was built with MPI so I now run the tests with USE_MPI=TRUE. Also this means that I had to change GNUmakefile to use the environment variableSLURM_NTASKS if this is defined (it was previously hard coded to 2 ranks). It also means I had to override the system version of HDF5 by defining HDF5_HOME.
  • I have refactored the definition of LOCKED_FILE to QOS_INTR_LOCK_FILE and changed it's location to $HOME
  • I have fixed the location of AMREX_HOME in case of control flowing to the else statement

@julianakwan julianakwan requested a review from mirenradia March 12, 2025 22:32
@mirenradia
Copy link
Member

I think something has gone wrong with the history on this branch and a load of old commits have been brought in. Would you be able to fix it?

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 1539a28 to 209132c Compare March 17, 2025 11:13
@mirenradia
Copy link
Member

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command

nvidia-cuda-mps-control -d

in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

@mirenradia
Copy link
Member

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

@julianakwan
Copy link
Contributor Author

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

For whatever reason, GitHub will not let me unlink #76 (it is greyed out for me). I can unlink #95 but that's the opposite of what we want!

@mirenradia
Copy link
Member

Maybe if you edit the original description to remove the "close Issue 76" bit?

@julianakwan
Copy link
Contributor Author

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command

nvidia-cuda-mps-control -d

in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

I think the regression test is pretty small so probably not a great representation of our capacity for GPU workloads...I think ideally we would have something like the ExCALIBUR Spack/ReFrame testing as part of our CI.

@mirenradia
Copy link
Member

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

It doesn't but we might as well use 2 tasks, particularly for the regression test.

For the unit tests launch line, can you try launching with salloc?

@mirenradia
Copy link
Member

I think we can still stick to 1 GPU (I had assumed this before hence why I suggested Nvidia MPS)? AMReX will complain but it might speed up throughput of jobs in the queue.

@mirenradia
Copy link
Member

mirenradia commented Mar 21, 2025

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk. A couple of workarounds I can think of:

  1. Add sleep 30 after the srun command which runs the regression test. This improves the reliability but I think it might still fail occasionally.
  2. The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

@mirenradia
Copy link
Member

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists. I guess we'd need to put a timeout so that it doesn't loop for the full 1h of GitLab CI pipeline time limit if the file doesn't get written for whatever reason.

@julianakwan
Copy link
Contributor Author

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk.

Did the service desk give you any advice on your ticket?

1. Add `sleep 30` after the `srun` command which runs the regression test. This improves the reliability but I think it might still fail occasionally.

2. The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

I can try these in the meantime. I think some combination of 1. and

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists.

might work

@mirenradia
Copy link
Member

Did the service desk give you any advice on your ticket?

They suggested using the sync command or the fsync() C function. The former didn't help and the latter requires you to being use the C file API and have access to a file descriptor which C++ doesn't give you.

I think we will just have to use [some] of the workarounds I suggested above.

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch 2 times, most recently from 6dbbbb4 to 47bf6dd Compare April 15, 2025 14:15
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from c89c7ef to b4ef2d1 Compare April 16, 2025 10:10
@mirenradia
Copy link
Member

I've just installed ccache to the CSD3 service account. Would you be to add USE_CCACHE=TRUE to the BUILD_CONFIG?

To share built items between stages, I've moved them
to a folder outside of the usual GitLab CI location.
This is because the execs are too large to use
GitLab artifacts.
This will also add a formatting check for shell scripts that is
consistent with the GitLab style guide as a pre-commit hook
The directory name now contains the CI_PIPELINE_ID variable so
subsequent jobs won't overwrite the contents of the directory.
I also added an extra line to the test stage to remove the
directory afterwards.
I believe BINARY_DIR is a more descriptive name for
the location of the executables produced in the test
stage. I've also placed the removal of this directory
in the specific after_script of the test stage.
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 748327e to 96d9c0c Compare May 16, 2025 16:01
This will move the definition of BINARY_DIR to the before_script
so there is no need for a dotenv artifact. Also I changed the
ordering of the while loop so it is hopefully clearer when things
go wrong with the filesystem.
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch 5 times, most recently from e826f82 to 43f4ee3 Compare May 21, 2025 10:22
@julianakwan
Copy link
Contributor Author

Could you use $CI_PIPELINE_IID (or $CI_PIPELINE_ID) and a100 in the STORE variable so that it's unique for each pipeline (and partition for when we do #96)? I think we should clean this directory in after_script.

Could you also move some of the extra YAML files/scripts into a new .gitlab subdirectory so that the top level directory is a bit cleaner? (I'm sure @KAClough will be happy about another top level directory 😈)

I've resolved both of these issues. There is now a separate directory called .gitlab which contains .gitlab-ci-common.yml
Also:

  • The plot files now output to the RDS DiRAC dp002 storage directory in a special directory called grteclyn-ci-svc. It is write protected from other members of dp002 so no one should be able to delete our stuff. Each run will output to it's own subdirectory called a100-test-$CI_PIPELINE_ID so named because the job is called a100-test. I propose to use the naming convention job/stage name-$CI_PIPELINE_ID but I am open to other suggestions.
  • I have done some refactoring of the environment variables. $STORE has been renamed to $BINARY_DIR. Note that $HOME is undefined under global variables: so any variable that needs it should be defined in the script or before_script section. The refactoring also means that I have removed artifacts altogether ($BINARY_DIR is defined in the common before_script inside .gitlab/.gitlab-ci-common.yml)
  • I have done some refactoring of the loops to make it clearer that YAML treats each loop as a single line. I removed the - in front and replaced them with the multiline block folded style > character which replaces each new line with a space. I found this to be helpful for my own understanding, so maybe it will be of use for other people starting out with CI.
  • I removed the common after_script, since csd3-a100-test needs its own set of clean up (for builds and the outputs) and any specific after_script defined within a stage will overwrite the common after_script. (You cannot get it to do a merge, I have checked). Also note that variables defined in the before_script or script sections cannot be accessed in the after_script as this is run as a separate instance. I think the variables here could be tidied up so I am open to suggestions, but I do not believe the after_script will allow artifacts. The tidyup needs to go in the after_script and not the script section because it needs to remove the stuff we've made regardless of whether the job fails or not.

The output directory is not being deleted because the if
statement is never evaluated
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 3d168a0 to b5e20b5 Compare May 22, 2025 14:08
Copy link
Member

@mirenradia mirenradia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add .gitlab-ci-dawn.yml in b82aa8d? I don't think it's used elsewhere so, to keep the history clean, can you remove it from this PR (feel free to move it onto a separate branch for when we tackle #96)?

Ideally it would be nice to move the building of the test executable to the "build" stage but, without resorting to using artifacts again, I'm not sure how to do that given that it needs to be run from where it is built in the Tests subdirectory.

The directory storing the binaries and the outputs weren't being
deleted because the after_script doesn't have access to
environment variables defined in the before_script and in the
script
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from b5e20b5 to d2de39f Compare June 13, 2025 15:51
@julianakwan
Copy link
Contributor Author

julianakwan commented Jun 13, 2025

Why did you add .gitlab-ci-dawn.yml in b82aa8d? I don't think it's used elsewhere so, to keep the history clean, can you remove it from this PR (feel free to move it onto a separate branch for when we tackle #96)?

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Ideally it would be nice to move the building of the test executable to the "build" stage but, without resorting to using artifacts again, I'm not sure how to do that given that it needs to be run from where it is built in the Tests subdirectory.

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

@mirenradia
Copy link
Member

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Weird. Hopefully it's fixed now?

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Yes, maybe this is a bit better.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

Actually, I think the token is only used by GitLab to send the pipeline status to GitHub. If you look at your last commit, the status of the GitLab pipeline is no longer part of the status checks.

@julianakwan julianakwan requested a review from mirenradia June 16, 2025 14:58
@julianakwan
Copy link
Contributor Author

julianakwan commented Jun 16, 2025

Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮‍💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.

Weird. Hopefully it's fixed now?

Yeah! I have .gitlab-ci-dawn.yml on a separate branch now.

I could move the building of the tests to the csd3-a100-build stage and then move Tests3d.gnu.DEBUG.CUDA.MPI.ex back to the Tests directory in the csd3-a100-test stage.

Yes, maybe this is a bit better.

Done! The test build is now in the build stage and then the test stage moves it back to the right directory.

Did you update the token? I won't mess around with it if it is going to fail because the token is expired.

Actually, I think the token is only used by GitLab to send the pipeline status to GitHub. If you look at your last commit, the status of the GitLab pipeline is no longer part of the status checks.

Thanks - good to know. Yeah, I noticed that it wasn't being tracked anymore.

@mirenradia
Copy link
Member

Could you remove the trailing whitespaces in .gitlab-ci.yml? clang-format picks this up for C++ source code but I guess we should add a pre-commit/CI check for other files.

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from bc212b5 to 8ece58a Compare June 23, 2025 16:40
@julianakwan
Copy link
Contributor Author

julianakwan commented Jun 23, 2025

Could you remove the trailing whitespaces in .gitlab-ci.yml?

Fixed.

clang-format picks this up for C++ source code but I guess we should add a pre-commit/CI check for other files.

This belongs in a separate PR. I have two solutions: test/amrex-style-files and test/yamllint. I have opened a draft PR for each (#123 and #124) and you can pick the one you like the best.

Copy link
Member

@mirenradia mirenradia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Sorry about the delay in reviewing this.

@mirenradia mirenradia merged commit 2ef4d7e into develop Jun 25, 2025
56 checks passed
@mirenradia mirenradia deleted the enhancement/gpu-regression-test branch June 25, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regression test to CSD3 A100 pipeline
2 participants