-
Notifications
You must be signed in to change notification settings - Fork 8
Sets Up Autotesting on SNL Machines #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| name: 'Show workflow trigger' | ||
| description: 'Prints what triggered this workflow' | ||
|
|
||
| runs: | ||
| using: "composite" | ||
| steps: | ||
| - name: Print trigger info | ||
| uses: actions/github-script@v7 | ||
| with: | ||
| script: | | ||
| const eventName = context.eventName; | ||
| const actor = context.actor || 'unknown'; // Default to 'unknown' if actor is not defined | ||
| let eventAction = 'N/A'; | ||
|
|
||
| // Determine the event action based on the event type | ||
| if (eventName === 'pull_request') { | ||
| eventAction = context.payload.action || 'N/A'; | ||
| } else if (eventName === 'pull_request_review') { | ||
| eventAction = context.payload.review.state || 'N/A'; | ||
| } else if (eventName === 'workflow_dispatch') { | ||
| eventAction = 'manual trigger'; | ||
| } else if (eventName === 'schedule') { | ||
| eventAction = 'scheduled trigger'; | ||
| } | ||
| gh_ref = github.ref || 'N/A'; | ||
| console.log(`The job was triggered by a ${eventName} event.`); | ||
| console.log(` - Event action: ${eventAction}`); | ||
| console.log(` - Triggered by: ${actor}`); | ||
| console.log(` - GH ref is: ${gh_ref}`); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # Autotester2 (AT2) SNL Workflow for MAM4xx | ||
|
|
||
| This document contains a brief description of how AT2 is used to automate | ||
| testing on SNL hardware. | ||
| Additionally, any helpful notes and TODOs may be kept here to assist developers. | ||
|
|
||
| ## Overview | ||
|
|
||
| AT2 is a Sandia-developed project for automating testing via GitHub Actions to | ||
| be run on self-hosted runners on the SNL network. | ||
| Part of what AT2 does is control access using information about the repository, | ||
| organization, user, etc. obtained via the GitHub API. | ||
| This is done for security/policy reasons and ensures that only those with | ||
| approved SNL computing accounts can run the CI code on SNL hardware. | ||
|
|
||
| ### Test Hardware and Compiler Configurations | ||
|
|
||
| | Test Name | GPU Brand | GPU Type | Micoarchitecture | Compute Capability | Machine | Compilers | | ||
| |-|-|-|-|-|-|-| | ||
| | gcc_12-3-0_cuda_12-1 | NVIDIA | H100 | Hopper | 9.0 | blake | `gcc` 12.3.0/`nvcc` 12.1.105 | | ||
|
|
||
| ### The Flow of the CI Workflow | ||
|
|
||
| AT2 runs on the target SNL machine and makes a handful of self-hosted runners | ||
| available to the MAM4xx repo. | ||
| This is all controlled by the **MAM4xx** SNL entity account that is linked to the | ||
| **mam4xxSNL** github account. | ||
| Each runner stays in a "holding pattern" until it is assigned a job via | ||
| GitHub Actions. | ||
| The holding pattern pulls the testing image from the AT2 Gitlab | ||
| repo (if necessary), runs the related container for 3 minutes, and then tears down and | ||
| starts over. | ||
| As of now, the image is of a UBI 8 system, with Spack-installed compilers and | ||
| all of the requisite TPLs to clone/build/run MAM4xx. | ||
|
|
||
| #### Triggering the Testing Workflow | ||
|
|
||
| This autotesting workflow is triggered by opening a pull request to `main` and | ||
| also by a handful of actions on such a PR that is already open, including: | ||
|
|
||
| - `reopened` | ||
| - `ready_for_review` | ||
| - I.e., converted to ***Ready for Review*** from ***Draft*** | ||
| - `synchronize` | ||
| - E.g., pushing a new commit or force pushing after rebase | ||
|
|
||
| The workflow may also be run manually by members of the `snl-testing` | ||
| team--that is, via | ||
|
|
||
| > **Actions** -> `<SNL-AT2 Workflow Run/Job>` -> **Re-run `[all,this]` job(s)**. | ||
|
|
||
| The AT2 configuration on `blake` currently attempts to keep 3 runners available | ||
| to accept jobs at all times. | ||
| This workflow is configured to allow concurrent testing, so up to 3 test-matrix | ||
| configurations can run at once. | ||
| The concurrencty setting is also configured to kill any active job if another | ||
| instance of this workflow is started for the same PR ref. | ||
|
|
||
| ##### Other Types of Job Control | ||
|
|
||
| - If a PR contains changes to the `.github` directory, a member of the | ||
| `snl-testing-admins` team must add the `CI-AT2_special_approval` tag to the | ||
| PR in order to kick off the autotesting. | ||
| - For changes unrelated to the `.github` directory, any PR that is submitted | ||
| by a member of the `snl-testing` team, and *only contains commits* from | ||
| members of that team will automatically trigger this autotesting. | ||
| - In the case that the PR is submitted by someone who is not a member of the | ||
| `snl-testing` team or contains commits from someone outside of that team, | ||
| an approving review by someone on the `snl-testing` team is required to | ||
| trigger autotesting. | ||
|
|
||
| ###### Disclaimer | ||
|
|
||
| The above is according to Mike's current understanding of AT2 and may contain | ||
| minor inaccuracies. | ||
| This will be updated accordingly upon confirmation. | ||
|
|
||
|
|
||
| ## Development Details | ||
|
|
||
| Most of the required configuration is provided by the AT2 docs and | ||
| instructional Confluence page (on the Sandia network :confused:--reach out if | ||
| you need access). | ||
| However, some non-obvious choices and configurations are listed here. | ||
|
|
||
| - To add some info to the testing output, we employ a custom action, cribbed | ||
| from E3SM/EAMxx, that prints out the workflow's trigger. | ||
|
|
||
| ### Hacks | ||
|
|
||
| - For whatever reason, Skywalker does not like building in the | ||
| `gcc_12-3-0_cuda_12-1` container for the H100 GPU. | ||
| - This appears to be an issue of the (Haero?) build not auto-detecting the | ||
| correct Compute Capability (CC 9.0 => `sm_90`). | ||
| - To overcome this, we first obtain the CC flag via `nvidia-smi` within the | ||
| testing container. | ||
| - Then, we employ `sed` to manually change the `default_arch="sm_<xyz>"` of | ||
| the Haero-provided `nvcc_wrapper` (`haero_install/bin/nvcc_wrapper`). | ||
| - We follow up with a quick `grep` to confirm this. | ||
|
|
||
| ### Tokens | ||
|
|
||
| - AT2 requires 2 fine-grained tokens for the **mam4xxSNL** account from the | ||
| `eagles-project` GitHub Organization in order to access information related | ||
| to the `mam4xx` repo. | ||
| - One token used to fetch and read/write runner information. | ||
| - **Expires 11 April 2026** | ||
| - One token used fetch and read repository information via the API. | ||
| - **Expires 2 May 2025** | ||
|
|
||
| ## TODO | ||
|
|
||
| - [ ] Update job control section of README after the behavior is made clear. | ||
| - @mjschmdt271 | ||
| - [ ] Include a script to generate plots from within testing container? | ||
| - @jaelynlitz? | ||
| - [ ] Unify all CI into a single top-level yaml file that calls the sub-cases. | ||
| - This should provide finer control over what runs and when. | ||
| - [ ] Add testing for AMD GPUs on `caraway`. | ||
|
|
||
| ### Low-priority | ||
|
|
||
| - [ ] Add CPU testing on `mappy` because "heck, why not?" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| name: gcc_12-3_cuda_12-1 | ||
|
|
||
| on: | ||
| workflow_call | ||
|
|
||
| jobs: | ||
| gcc-cuda: | ||
| runs-on: [self-hosted, m4xci-snl-cuda, cuda, gcc] | ||
| # will run other tests in the matrix even if one fails | ||
| # NOTE: prioritizes extra info over speed, so consider whether this makes sense | ||
| continue-on-error: false | ||
| strategy: | ||
| fail-fast: true | ||
| matrix: | ||
| build-type: [Debug, Release] | ||
| fp-precision: [single, double] | ||
| name: gcc-cuda / ${{ matrix.build-type }} - ${{ matrix.fp-precision }} | ||
| steps: | ||
| - name: Check out the repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| persist-credentials: false | ||
| show-progress: false | ||
| submodules: recursive | ||
| - name: Cloning Haero | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| repository: eagles-project/haero | ||
| submodules: recursive | ||
| path: haero_src | ||
| - name: Show action trigger | ||
| uses: ./.github/actions/show-workflow-trigger | ||
| - name: Get CUDA Arch | ||
| # NOTE: for now, only running on an H100 machine, but keep anyway | ||
| run: | | ||
| # Ensure nvidia-smi is available | ||
| if ! command -v nvidia-smi &> /dev/null; then | ||
| echo "nvidia-smi could not be found. Please ensure you have Nvidia drivers installed." | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Get the GPU model from nvidia-smi, and set env for next step | ||
| gpu_model=$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n 1) | ||
| case "$gpu_model" in | ||
| *"H100"*) | ||
| echo "H100 detected--setting Hopper90 architecture" | ||
| echo "Hopper=ON" >> $GITHUB_ENV | ||
| echo "CUDA_ARCH=90" >> $GITHUB_ENV | ||
| ARCH=90 | ||
| ;; | ||
| *"A100"*) | ||
| echo "A100 detected--setting Ampere80 architecture" | ||
| echo "Ampere=ON" >> $GITHUB_ENV | ||
| echo "CUDA_ARCH=80" >> $GITHUB_ENV | ||
| ;; | ||
| *"V100"*) | ||
| echo "V100 detected--setting Volta70 architecture" | ||
| echo "Volta=ON" >> $GITHUB_ENV | ||
| echo "CUDA_ARCH=70" >> $GITHUB_ENV | ||
| ;; | ||
| *) | ||
| echo "Unsupported GPU model: $gpu_model" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| - name: Building Haero (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision) | ||
| run: | | ||
| cmake -S haero_src -B haero_build \ | ||
| -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \ | ||
| -DCMAKE_INSTALL_PREFIX="haero_install" \ | ||
| -DCMAKE_C_COMPILER=gcc \ | ||
| -DCMAKE_CXX_COMPILER=g++ \ | ||
| -DHAERO_ENABLE_MPI=OFF \ | ||
| -DHAERO_ENABLE_GPU=ON \ | ||
| -DHAERO_PRECISION=${{ matrix.fp-precision }} | ||
| cd haero_build | ||
| make -j | ||
| make install | ||
| - name: Set nvcc_wrapper Arch | ||
| run: | | ||
| sed -i s/default_arch=\"sm_70\"/default_arch=\"sm_"$CUDA_ARCH"\"/g `pwd`/haero_install/bin/nvcc_wrapper | ||
| echo "====================================" | ||
| grep -i "default_arch=" `pwd`/haero_install/bin/nvcc_wrapper | ||
| - name: Configuring MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision) | ||
| run: | | ||
| cmake -S . -B build \ | ||
| -DCMAKE_CXX_COMPILER=`pwd`/haero_install/bin/nvcc_wrapper \ | ||
| -DCMAKE_C_COMPILER=gcc \ | ||
| -DCMAKE_INSTALL_PREFIX=`pwd`/install \ | ||
| -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \ | ||
| -DMAM4XX_HAERO_DIR=`pwd`/haero_install \ | ||
| -DNUM_VERTICAL_LEVELS=72 \ | ||
| -DENABLE_COVERAGE=OFF \ | ||
| -DENABLE_SKYWALKER=ON \ | ||
| -DCMAKE_CUDA_ARCHITECTURES=$CUDA_ARCH \ | ||
| -G "Unix Makefiles" | ||
| - name: Building MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision) | ||
| run: | | ||
| cd build | ||
| make | ||
| - name: Running tests (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision) | ||
| run: | | ||
| cd build | ||
| ctest -V --output-on-failure |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| name: SNL-AT2 | ||
|
|
||
| on: | ||
| # Runs on PRs against main | ||
| pull_request: | ||
| branches: [ main ] | ||
| types: [opened, synchronize, ready_for_review, reopened] | ||
| paths: | ||
| # first, yes to these | ||
| - '.github/workflows/at2_snl.yml' | ||
| - 'src/mam4xx' | ||
| - 'src/tests' | ||
| - 'src/validation/**' | ||
| # second, no to these | ||
| - '!src/tests/data/**' | ||
| # not sure whether this should be disabled--keep for now | ||
| # - '!src/validation/mam_x_validation/**' | ||
|
|
||
| # Manual run | ||
| workflow_dispatch: | ||
|
|
||
| # # Add schedule trigger for nightly runs at midnight MT (Standard Time) | ||
| # schedule: | ||
| # - cron: '0 7 * * *' # Runs at 7 AM UTC, which is midnight MT during Standard Time | ||
|
|
||
| concurrency: | ||
| # Two runs are in the same group if they are testing the same git ref | ||
| # - if trigger=pull_request, the ref is refs/pull/<PR_NUMBER>/merge | ||
| # - for other triggers, the ref is the branch tested | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| gcc-cuda: | ||
| uses: | ||
| ./.github/workflows/at2_gcc-cuda.yml | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should name this 'AT' only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the software product that facilitates this is "AT2 (Autotester2)"--however, what we call it on our side doesn't make a bit of difference to me