Skip to content

CI: Add regression test for A100s #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

julianakwan
Copy link
Contributor

@julianakwan julianakwan commented Feb 27, 2025

This PR adds the Binary BH regression test to the GitLab CI script and also updates the current test build with the USE_HDF5 = TRUE flag (so that those tests can be carried out as well). I also wrapped the srun commands with flock to prevent more than one job submitting to the interactive queue at time, which will cause the srun to fail.

NB: This only closes Issue #95 because I would still like to add a regression test for the PVC build.

@julianakwan julianakwan added the enhancement New feature or request label Feb 27, 2025
@julianakwan julianakwan self-assigned this Feb 27, 2025
@mirenradia mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request
@mirenradia mirenradia linked an issue Mar 3, 2025 that may be closed by this pull request
@julianakwan julianakwan changed the title Enhancement/gpu regression test CI: Add regression test for A100s Mar 3, 2025
Copy link
Member

@mirenradia mirenradia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Juliana for this PR. I know that you've not had the most fun experience with the GitLab CI/CD pipelines!

Just a couple of minor changes.

Ideally I would like to split this pipeline up into multiple stages (e.g. build and then run) but, if we want to re-use the previously built binaries, we need support for "artifacts". This requires the gitlab-runner command to be available on CSD3 so let's defer this for future work.

@mirenradia
Copy link
Member

For some reason, #76 has been added to this PR and I can't seem to be able to remove it.

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from d607f84 to 81471d6 Compare March 11, 2025 18:35
@julianakwan
Copy link
Contributor Author

Thanks for your review @mirenradia !

Summary of changes:

  • Version of HDF5 loaded is now compatible with RHEL8. However, it was built with MPI so I now run the tests with USE_MPI=TRUE. Also this means that I had to change GNUmakefile to use the environment variableSLURM_NTASKS if this is defined (it was previously hard coded to 2 ranks). It also means I had to override the system version of HDF5 by defining HDF5_HOME.
  • I have refactored the definition of LOCKED_FILE to QOS_INTR_LOCK_FILE and changed it's location to $HOME
  • I have fixed the location of AMREX_HOME in case of control flowing to the else statement

@julianakwan julianakwan requested a review from mirenradia March 12, 2025 22:32
@mirenradia
Copy link
Member

I think something has gone wrong with the history on this branch and a load of old commits have been brought in. Would you be able to fix it?

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 1539a28 to 209132c Compare March 17, 2025 11:13
@mirenradia
Copy link
Member

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command

nvidia-cuda-mps-control -d

in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

@mirenradia
Copy link
Member

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

@julianakwan
Copy link
Contributor Author

@julianakwan, can you unlink #76 from this PR? I think it will autoclose once this is merged.

For whatever reason, GitHub will not let me unlink #76 (it is greyed out for me). I can unlink #95 but that's the opposite of what we want!

@mirenradia
Copy link
Member

Maybe if you edit the original description to remove the "close Issue 76" bit?

@julianakwan
Copy link
Contributor Author

Now that you've set it up to look at the SLURM_NTASKS environment variable, we should change the number of tasks to 2 in SRUN_FLAGS.

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

If we cared about performance, we might look at enabling the Nvidia CUDA Multi-Process Service (MPS) which is available on the CSD3 A100s but this will require running the command

nvidia-cuda-mps-control -d

in each interactive job and thus require us to change how we launch these interactive jobs. I don't think it's worth bothering with this.

I think the regression test is pretty small so probably not a great representation of our capacity for GPU workloads...I think ideally we would have something like the ExCALIBUR Spack/ReFrame testing as part of our CI.

@mirenradia
Copy link
Member

I thought it didn't need 2 tasks so I changed it to use whatever Slurm parameters we were setting! :) No prob, I'll go ahead and update it.

It doesn't but we might as well use 2 tasks, particularly for the regression test.

For the unit tests launch line, can you try launching with salloc?

@mirenradia
Copy link
Member

I think we can still stick to 1 GPU (I had assumed this before hence why I suggested Nvidia MPS)? AMReX will complain but it might speed up throughput of jobs in the queue.

@mirenradia
Copy link
Member

mirenradia commented Mar 21, 2025

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk. A couple of workarounds I can think of:

  1. Add sleep 30 after the srun command which runs the regression test. This improves the reliability but I think it might still fail occasionally.
  2. The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

@mirenradia
Copy link
Member

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists. I guess we'd need to put a timeout so that it doesn't loop for the full 1h of GitLab CI pipeline time limit if the file doesn't get written for whatever reason.

@julianakwan
Copy link
Contributor Author

I have been investigating the problem of fcompare sometimes not being able to open the plot file (e.g. this run). I think the problem is more general on the CSD3 NFS storage which provides /home and I have opened a ticket with our service desk.

Did the service desk give you any advice on your ticket?

1. Add `sleep 30` after the `srun` command which runs the regression test. This improves the reliability but I think it might still fail occasionally.

2. The service account on CSD3 which runs the CI has access to the dp002 RDS storage. We could try running the regression test there instead. This is Lustre rather than NFS so I wouldn't expect this problem to also affect this storage.

I can try these in the meantime. I think some combination of 1. and

Another workaround is to use a while loop to check for existence of the file and only execute the fcompare bit once it exists.

might work

@mirenradia
Copy link
Member

Did the service desk give you any advice on your ticket?

They suggested using the sync command or the fsync() C function. The former didn't help and the latter requires you to being use the C file API and have access to a file descriptor which C++ doesn't give you.

I think we will just have to use [some] of the workarounds I suggested above.

@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch 2 times, most recently from 6dbbbb4 to 47bf6dd Compare April 15, 2025 14:15
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from c89c7ef to b4ef2d1 Compare April 16, 2025 10:10
@mirenradia
Copy link
Member

I've just installed ccache to the CSD3 service account. Would you be to add USE_CCACHE=TRUE to the BUILD_CONFIG?

This will add an environment variable AMREX_HOME for AMReX to be
installed so the Plotfile tools can be accessed later on without
having to navigate the directory structure.
This will revert back to the original directory structure on Arcus
but I have stored the AMReX directory so I can reference it again
later to build the fcompare tool.
This compares the output of params_test.ini with a pre-existing
file in the .github/workflows/data directory
I removed the ls to check the output of the test run. Also, instead
of building all AMReX plot tools I've added project=fcompare to
the make command so only fcompare is built.
- Start another job called csd3-dawn
- Only set up the environment modules for now
- Add line to start an interactive session on Dawn
- Build fcompare within interactive session
- Also fix a typo from previous commit, should be projects=...
- Dawn modules are now only loaded at the start of an
interactive job
- Interactive jobs via salloc don't seem to be allowed so
switching to srun instead
Switch to version not compiled with MPI support because the tests don't
use MPI, also turn off HDF5 for the BinaryBH regression test.
Only one job can be submitted to the interactive queue at once, so
I've added a file lock on a dummy file, file.lock, such that the
srun commands are submitted to the queue one at a time
This is in case the srun command within flock fails. Also change
the location of the locked file to be common to both times flock
is called (so I can delete it).
This will also:
  - Build tests with MPI
  - Fix definition of AMREX_HOME
  - Clean up flock and LOCKED_FILE variable
This commit will:
  - Change run to use the number of available MPI ranks as per
    SLURM_NTASKS. If SLURM_NTASKS is not defined, then leave
    the number of tasks as 2.
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 5d08a3e to 3d4bd57 Compare April 17, 2025 14:06
@julianakwan julianakwan force-pushed the enhancement/gpu-regression-test branch from 3d4bd57 to ae62429 Compare April 17, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regression test to CSD3 A100 pipeline
2 participants