-
Notifications
You must be signed in to change notification settings - Fork 5
CI: Add regression test for A100s #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Juliana for this PR. I know that you've not had the most fun experience with the GitLab CI/CD pipelines!
Just a couple of minor changes.
Ideally I would like to split this pipeline up into multiple stages (e.g. build
and then run
) but, if we want to re-use the previously built binaries, we need support for "artifacts". This requires the gitlab-runner
command to be available on CSD3 so let's defer this for future work.
For some reason, #76 has been added to this PR and I can't seem to be able to remove it. |
d607f84
to
81471d6
Compare
Thanks for your review @mirenradia ! Summary of changes:
|
I think something has gone wrong with the history on this branch and a load of old commits have been brought in. Would you be able to fix it? |
1539a28
to
209132c
Compare
Now that you've set it up to look at the If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command
in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this. |
@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged. |
For whatever reason, GitHub will not let me unlink #76 (it is greyed out for me). I can unlink #95 but that's the opposite of what we want! |
Maybe if you edit the original description to remove the "close Issue 76" bit? |
I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.
I think the regression test is pretty small so probably not a great representation of our capacity for GPU workloads...I think ideally we would have something like the ExCALIBUR Spack/ReFrame testing as part of our CI. |
It doesn't but we might as well use 2 tasks, particularly for the regression test. For the unit tests launch line, can you try launching with |
I think we can still stick to 1 GPU (I had assumed this before hence why I suggested Nvidia MPS)? AMReX will complain but it might speed up throughput of jobs in the queue. |
I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides
|
Another workaround is to use a while loop to check for existence of the file and only execute the |
Did the service desk give you any advice on your ticket?
I can try these in the meantime. I think some combination of 1. and
might work |
They suggested using the I think we will just have to use [some] of the workarounds I suggested above. |
6dbbbb4
to
47bf6dd
Compare
c89c7ef
to
b4ef2d1
Compare
I've just installed |
This will add an environment variable AMREX_HOME for AMReX to be installed so the Plotfile tools can be accessed later on without having to navigate the directory structure.
This will revert back to the original directory structure on Arcus but I have stored the AMReX directory so I can reference it again later to build the fcompare tool.
This compares the output of params_test.ini with a pre-existing file in the .github/workflows/data directory
I removed the ls to check the output of the test run. Also, instead of building all AMReX plot tools I've added project=fcompare to the make command so only fcompare is built.
- Start another job called csd3-dawn - Only set up the environment modules for now
- Add line to start an interactive session on Dawn - Build fcompare within interactive session - Also fix a typo from previous commit, should be projects=...
- Dawn modules are now only loaded at the start of an interactive job
- Interactive jobs via salloc don't seem to be allowed so switching to srun instead
Switch to version not compiled with MPI support because the tests don't use MPI, also turn off HDF5 for the BinaryBH regression test.
Only one job can be submitted to the interactive queue at once, so I've added a file lock on a dummy file, file.lock, such that the srun commands are submitted to the queue one at a time
This is in case the srun command within flock fails. Also change the location of the locked file to be common to both times flock is called (so I can delete it).
This will also: - Build tests with MPI - Fix definition of AMREX_HOME - Clean up flock and LOCKED_FILE variable
This commit will: - Change run to use the number of available MPI ranks as per SLURM_NTASKS. If SLURM_NTASKS is not defined, then leave the number of tasks as 2.
5d08a3e
to
3d4bd57
Compare
3d4bd57
to
ae62429
Compare
This PR adds the Binary BH regression test to the GitLab CI script and also updates the current test build with the
USE_HDF5 = TRUE
flag (so that those tests can be carried out as well). I also wrapped thesrun
commands withflock
to prevent more than one job submitting to the interactive queue at time, which will cause thesrun
to fail.NB: This only closes Issue #95 because I would still like to add a regression test for the PVC build.