-
Notifications
You must be signed in to change notification settings - Fork 7
CI: Add regression test for A100s #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Juliana for this PR. I know that you've not had the most fun experience with the GitLab CI/CD pipelines!
Just a couple of minor changes.
Ideally I would like to split this pipeline up into multiple stages (e.g. build
and then run
) but, if we want to re-use the previously built binaries, we need support for "artifacts". This requires the gitlab-runner
command to be available on CSD3 so let's defer this for future work.
For some reason, #76 has been added to this PR and I can't seem to be able to remove it. |
d607f84
to
81471d6
Compare
Thanks for your review @mirenradia ! Summary of changes:
|
I think something has gone wrong with the history on this branch and a load of old commits have been brought in. Would you be able to fix it? |
1539a28
to
209132c
Compare
Now that you've set it up to look at the If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command
in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this. |
@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged. |
For whatever reason, GitHub will not let me unlink #76 (it is greyed out for me). I can unlink #95 but that's the opposite of what we want! |
Maybe if you edit the original description to remove the "close Issue 76" bit? |
I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.
I think the regression test is pretty small so probably not a great representation of our capacity for GPU workloads...I think ideally we would have something like the ExCALIBUR Spack/ReFrame testing as part of our CI. |
It doesn't but we might as well use 2 tasks, particularly for the regression test. For the unit tests launch line, can you try launching with |
I think we can still stick to 1 GPU (I had assumed this before hence why I suggested Nvidia MPS)? AMReX will complain but it might speed up throughput of jobs in the queue. |
I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides
|
Another workaround is to use a while loop to check for existence of the file and only execute the |
Did the service desk give you any advice on your ticket?
I can try these in the meantime. I think some combination of 1. and
might work |
They suggested using the I think we will just have to use [some] of the workarounds I suggested above. |
6dbbbb4
to
47bf6dd
Compare
c89c7ef
to
b4ef2d1
Compare
I've just installed |
To share built items between stages, I've moved them to a folder outside of the usual GitLab CI location. This is because the execs are too large to use GitLab artifacts.
This will also add a formatting check for shell scripts that is consistent with the GitLab style guide as a pre-commit hook
The directory name now contains the CI_PIPELINE_ID variable so subsequent jobs won't overwrite the contents of the directory. I also added an extra line to the test stage to remove the directory afterwards.
I believe BINARY_DIR is a more descriptive name for the location of the executables produced in the test stage. I've also placed the removal of this directory in the specific after_script of the test stage.
748327e
to
96d9c0c
Compare
This will move the definition of BINARY_DIR to the before_script so there is no need for a dotenv artifact. Also I changed the ordering of the while loop so it is hopefully clearer when things go wrong with the filesystem.
e826f82
to
43f4ee3
Compare
I've resolved both of these issues. There is now a separate directory called
|
The output directory is not being deleted because the if statement is never evaluated
3d168a0
to
b5e20b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add .gitlab-ci-dawn.yml
in b82aa8d? I don't think it's used elsewhere so, to keep the history clean, can you remove it from this PR (feel free to move it onto a separate branch for when we tackle #96)?
Ideally it would be nice to move the building of the test executable to the "build" stage but, without resorting to using artifacts again, I'm not sure how to do that given that it needs to be run from where it is built in the Tests
subdirectory.
The directory storing the binaries and the outputs weren't being deleted because the after_script doesn't have access to environment variables defined in the before_script and in the script
b5e20b5
to
d2de39f
Compare
Ok, this file should be removed now. I had to delete the local copy of that branch and start again. 😮💨 Because I keep rebasing to clean up my git history, my git log is a bit messed up.
I could move the building of the tests to the Did you update the token? I won't mess around with it if it is going to fail because the token is expired. |
Weird. Hopefully it's fixed now?
Yes, maybe this is a bit better.
Actually, I think the token is only used by GitLab to send the pipeline status to GitHub. If you look at your last commit, the status of the GitLab pipeline is no longer part of the status checks. |
Yeah! I have
Done! The test build is now in the
Thanks - good to know. Yeah, I noticed that it wasn't being tracked anymore. |
Could you remove the trailing whitespaces in |
bc212b5
to
8ece58a
Compare
Fixed.
This belongs in a separate PR. I have two solutions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Sorry about the delay in reviewing this.
This PR adds the Binary BH regression test to the GitLab CI script and also updates the current test build with the
USE_HDF5 = TRUE
flag (so that those tests can be carried out as well). I also wrapped thesrun
commands withflock
to prevent more than one job submitting to the interactive queue at time, which will cause thesrun
to fail.NB: This only closes Issue #95 because I would still like to add a regression test for the PVC build.