Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/actions/show-workflow-trigger/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: 'Show workflow trigger'
description: 'Prints what triggered this workflow'

runs:
using: "composite"
steps:
- name: Print trigger info
uses: actions/github-script@v7
with:
script: |
const eventName = context.eventName;
const actor = context.actor || 'unknown'; // Default to 'unknown' if actor is not defined
let eventAction = 'N/A';

// Determine the event action based on the event type
if (eventName === 'pull_request') {
eventAction = context.payload.action || 'N/A';
} else if (eventName === 'pull_request_review') {
eventAction = context.payload.review.state || 'N/A';
} else if (eventName === 'workflow_dispatch') {
eventAction = 'manual trigger';
} else if (eventName === 'schedule') {
eventAction = 'scheduled trigger';
}
gh_ref = github.ref || 'N/A';
console.log(`The job was triggered by a ${eventName} event.`);
console.log(` - Event action: ${eventAction}`);
console.log(` - Triggered by: ${actor}`);
console.log(` - GH ref is: ${gh_ref}`);
123 changes: 123 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Autotester2 (AT2) SNL Workflow for MAM4xx

This document contains a brief description of how AT2 is used to automate
testing on SNL hardware.
Additionally, any helpful notes and TODOs may be kept here to assist developers.

## Overview

AT2 is a Sandia-developed project for automating testing via GitHub Actions to
be run on self-hosted runners on the SNL network.
Part of what AT2 does is control access using information about the repository,
organization, user, etc. obtained via the GitHub API.
This is done for security/policy reasons and ensures that only those with
approved SNL computing accounts can run the CI code on SNL hardware.

### Test Hardware and Compiler Configurations

| Test Name | GPU Brand | GPU Type | Micoarchitecture | Compute Capability | Machine | Compilers |
|-|-|-|-|-|-|-|
| gcc_12-3-0_cuda_12-1 | NVIDIA | H100 | Hopper | 9.0 | blake | `gcc` 12.3.0/`nvcc` 12.1.105 |

### The Flow of the CI Workflow

AT2 runs on the target SNL machine and makes a handful of self-hosted runners
available to the MAM4xx repo.
This is all controlled by the **MAM4xx** SNL entity account that is linked to the
**mam4xxSNL** github account.
Each runner stays in a "holding pattern" until it is assigned a job via
GitHub Actions.
The holding pattern pulls the testing image from the AT2 Gitlab
repo (if necessary), runs the related container for 3 minutes, and then tears down and
starts over.
As of now, the image is of a UBI 8 system, with Spack-installed compilers and
all of the requisite TPLs to clone/build/run MAM4xx.

#### Triggering the Testing Workflow

This autotesting workflow is triggered by opening a pull request to `main` and
also by a handful of actions on such a PR that is already open, including:

- `reopened`
- `ready_for_review`
- I.e., converted to ***Ready for Review*** from ***Draft***
- `synchronize`
- E.g., pushing a new commit or force pushing after rebase

The workflow may also be run manually by members of the `snl-testing`
team--that is, via

> **Actions** -> `<SNL-AT2 Workflow Run/Job>` -> **Re-run `[all,this]` job(s)**.

The AT2 configuration on `blake` currently attempts to keep 3 runners available
to accept jobs at all times.
This workflow is configured to allow concurrent testing, so up to 3 test-matrix
configurations can run at once.
The concurrencty setting is also configured to kill any active job if another
instance of this workflow is started for the same PR ref.

##### Other Types of Job Control

- If a PR contains changes to the `.github` directory, a member of the
`snl-testing-admins` team must add the `CI-AT2_special_approval` tag to the
PR in order to kick off the autotesting.
- For changes unrelated to the `.github` directory, any PR that is submitted
by a member of the `snl-testing` team, and *only contains commits* from
members of that team will automatically trigger this autotesting.
- In the case that the PR is submitted by someone who is not a member of the
`snl-testing` team or contains commits from someone outside of that team,
an approving review by someone on the `snl-testing` team is required to
trigger autotesting.

###### Disclaimer

The above is according to Mike's current understanding of AT2 and may contain
minor inaccuracies.
This will be updated accordingly upon confirmation.


## Development Details

Most of the required configuration is provided by the AT2 docs and
instructional Confluence page (on the Sandia network :confused:--reach out if
you need access).
However, some non-obvious choices and configurations are listed here.

- To add some info to the testing output, we employ a custom action, cribbed
from E3SM/EAMxx, that prints out the workflow's trigger.

### Hacks

- For whatever reason, Skywalker does not like building in the
`gcc_12-3-0_cuda_12-1` container for the H100 GPU.
- This appears to be an issue of the (Haero?) build not auto-detecting the
correct Compute Capability (CC 9.0 => `sm_90`).
- To overcome this, we first obtain the CC flag via `nvidia-smi` within the
testing container.
- Then, we employ `sed` to manually change the `default_arch="sm_<xyz>"` of
the Haero-provided `nvcc_wrapper` (`haero_install/bin/nvcc_wrapper`).
- We follow up with a quick `grep` to confirm this.

### Tokens

- AT2 requires 2 fine-grained tokens for the **mam4xxSNL** account from the
`eagles-project` GitHub Organization in order to access information related
to the `mam4xx` repo.
- One token used to fetch and read/write runner information.
- **Expires 11 April 2026**
- One token used fetch and read repository information via the API.
- **Expires 2 May 2025**

## TODO

- [ ] Update job control section of README after the behavior is made clear.
- @mjschmdt271
- [ ] Include a script to generate plots from within testing container?
- @jaelynlitz?
- [ ] Unify all CI into a single top-level yaml file that calls the sub-cases.
- This should provide finer control over what runs and when.
- [ ] Add testing for AMD GPUs on `caraway`.

### Low-priority

- [ ] Add CPU testing on `mappy` because "heck, why not?"
104 changes: 104 additions & 0 deletions .github/workflows/at2_gcc-cuda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
name: gcc_12-3_cuda_12-1

on:
workflow_call

jobs:
gcc-cuda:
runs-on: [self-hosted, m4xci-snl-cuda, cuda, gcc]
# will run other tests in the matrix even if one fails
# NOTE: prioritizes extra info over speed, so consider whether this makes sense
continue-on-error: false
strategy:
fail-fast: true
matrix:
build-type: [Debug, Release]
fp-precision: [single, double]
name: gcc-cuda / ${{ matrix.build-type }} - ${{ matrix.fp-precision }}
steps:
- name: Check out the repository
uses: actions/checkout@v4
with:
persist-credentials: false
show-progress: false
submodules: recursive
- name: Cloning Haero
uses: actions/checkout@v4
with:
repository: eagles-project/haero
submodules: recursive
path: haero_src
- name: Show action trigger
uses: ./.github/actions/show-workflow-trigger
- name: Get CUDA Arch
# NOTE: for now, only running on an H100 machine, but keep anyway
run: |
# Ensure nvidia-smi is available
if ! command -v nvidia-smi &> /dev/null; then
echo "nvidia-smi could not be found. Please ensure you have Nvidia drivers installed."
exit 1
fi

# Get the GPU model from nvidia-smi, and set env for next step
gpu_model=$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n 1)
case "$gpu_model" in
*"H100"*)
echo "H100 detected--setting Hopper90 architecture"
echo "Hopper=ON" >> $GITHUB_ENV
echo "CUDA_ARCH=90" >> $GITHUB_ENV
ARCH=90
;;
*"A100"*)
echo "A100 detected--setting Ampere80 architecture"
echo "Ampere=ON" >> $GITHUB_ENV
echo "CUDA_ARCH=80" >> $GITHUB_ENV
;;
*"V100"*)
echo "V100 detected--setting Volta70 architecture"
echo "Volta=ON" >> $GITHUB_ENV
echo "CUDA_ARCH=70" >> $GITHUB_ENV
;;
*)
echo "Unsupported GPU model: $gpu_model"
exit 1
;;
esac
- name: Building Haero (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
run: |
cmake -S haero_src -B haero_build \
-DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \
-DCMAKE_INSTALL_PREFIX="haero_install" \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_CXX_COMPILER=g++ \
-DHAERO_ENABLE_MPI=OFF \
-DHAERO_ENABLE_GPU=ON \
-DHAERO_PRECISION=${{ matrix.fp-precision }}
cd haero_build
make -j
make install
- name: Set nvcc_wrapper Arch
run: |
sed -i s/default_arch=\"sm_70\"/default_arch=\"sm_"$CUDA_ARCH"\"/g `pwd`/haero_install/bin/nvcc_wrapper
echo "===================================="
grep -i "default_arch=" `pwd`/haero_install/bin/nvcc_wrapper
- name: Configuring MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
run: |
cmake -S . -B build \
-DCMAKE_CXX_COMPILER=`pwd`/haero_install/bin/nvcc_wrapper \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_INSTALL_PREFIX=`pwd`/install \
-DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \
-DMAM4XX_HAERO_DIR=`pwd`/haero_install \
-DNUM_VERTICAL_LEVELS=72 \
-DENABLE_COVERAGE=OFF \
-DENABLE_SKYWALKER=ON \
-DCMAKE_CUDA_ARCHITECTURES=$CUDA_ARCH \
-G "Unix Makefiles"
- name: Building MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
run: |
cd build
make
- name: Running tests (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
run: |
cd build
ctest -V --output-on-failure
36 changes: 36 additions & 0 deletions .github/workflows/at2_snl.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: SNL-AT2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should name this 'AT' only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the software product that facilitates this is "AT2 (Autotester2)"--however, what we call it on our side doesn't make a bit of difference to me


on:
# Runs on PRs against main
pull_request:
branches: [ main ]
types: [opened, synchronize, ready_for_review, reopened]
paths:
# first, yes to these
- '.github/workflows/at2_snl.yml'
- 'src/mam4xx'
- 'src/tests'
- 'src/validation/**'
# second, no to these
- '!src/tests/data/**'
# not sure whether this should be disabled--keep for now
# - '!src/validation/mam_x_validation/**'

# Manual run
workflow_dispatch:

# # Add schedule trigger for nightly runs at midnight MT (Standard Time)
# schedule:
# - cron: '0 7 * * *' # Runs at 7 AM UTC, which is midnight MT during Standard Time

concurrency:
# Two runs are in the same group if they are testing the same git ref
# - if trigger=pull_request, the ref is refs/pull/<PR_NUMBER>/merge
# - for other triggers, the ref is the branch tested
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
gcc-cuda:
uses:
./.github/workflows/at2_gcc-cuda.yml