eagles-project · mam4xxSNL · Apr 16, 2025 · Apr 15, 2025 · Apr 15, 2025 · Apr 16, 2025
diff --git a/.github/actions/show-workflow-trigger/action.yml b/.github/actions/show-workflow-trigger/action.yml
@@ -0,0 +1,29 @@
+name: 'Show workflow trigger'
+description: 'Prints what triggered this workflow'
+
+runs:
+  using: "composite"
+  steps:
+    - name: Print trigger info
+      uses: actions/github-script@v7
+      with:
+        script: |
+          const eventName = context.eventName;
+          const actor = context.actor || 'unknown';  // Default to 'unknown' if actor is not defined
+          let eventAction = 'N/A';
+
+          // Determine the event action based on the event type
+          if (eventName === 'pull_request') {
+            eventAction = context.payload.action || 'N/A';
+          } else if (eventName === 'pull_request_review') {
+            eventAction = context.payload.review.state || 'N/A';
+          } else if (eventName === 'workflow_dispatch') {
+            eventAction = 'manual trigger';
+          } else if (eventName === 'schedule') {
+            eventAction = 'scheduled trigger';
+          }
+          gh_ref = github.ref || 'N/A';
+          console.log(`The job was triggered by a ${eventName} event.`);
+          console.log(`  - Event action: ${eventAction}`);
+          console.log(`  - Triggered by: ${actor}`);
+          console.log(`  - GH ref is:    ${gh_ref}`);
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -0,0 +1,123 @@
+# Autotester2 (AT2) SNL Workflow for MAM4xx
+
+This document contains a brief description of how AT2 is used to automate
+testing on SNL hardware.
+Additionally, any helpful notes and TODOs may be kept here to assist developers.
+
+## Overview
+
+AT2 is a Sandia-developed project for automating testing via GitHub Actions to
+be run on self-hosted runners on the SNL network.
+Part of what AT2 does is control access using information about the repository,
+organization, user, etc. obtained via the GitHub API.
+This is done for security/policy reasons and ensures that only those with
+approved SNL computing accounts can run the CI code on SNL hardware.
+
+### Test Hardware and Compiler Configurations
+
+| Test Name | GPU Brand | GPU Type | Micoarchitecture | Compute Capability | Machine | Compilers |
+|-|-|-|-|-|-|-|
+| gcc_12-3-0_cuda_12-1 | NVIDIA | H100 | Hopper | 9.0 | blake | `gcc` 12.3.0/`nvcc` 12.1.105 |
+
+### The Flow of the CI Workflow
+
+AT2 runs on the target SNL machine and makes a handful of self-hosted runners
+available to the MAM4xx repo.
+This is all controlled by the **MAM4xx** SNL entity account that is linked to the
+**mam4xxSNL** github account.
+Each runner stays in a "holding pattern" until it is assigned a job via
+GitHub Actions.
+The holding pattern pulls the testing image from the AT2 Gitlab
+repo (if necessary), runs the related container for 3 minutes, and then tears down and
+starts over.
+As of now, the image is of a UBI 8 system, with Spack-installed compilers and
+all of the requisite TPLs to clone/build/run MAM4xx.
+
+#### Triggering the Testing Workflow
+
+This autotesting workflow is triggered by opening a pull request to `main` and
+also by a handful of actions on such a PR that is already open, including:
+
+- `reopened`
+- `ready_for_review`
+  - I.e., converted to ***Ready for Review*** from ***Draft***
+- `synchronize`
+  - E.g., pushing a new commit or force pushing after rebase
+
+The workflow may also be run manually by members of the `snl-testing`
+team--that is, via
+
+> **Actions** -> `<SNL-AT2 Workflow Run/Job>` -> **Re-run `[all,this]` job(s)**.
+
+The AT2 configuration on `blake` currently attempts to keep 3 runners available
+to accept jobs at all times.
+This workflow is configured to allow concurrent testing, so up to 3 test-matrix
+configurations can run at once.
+The concurrencty setting is also configured to kill any active job if another
+instance of this workflow is started for the same PR ref.
+
+##### Other Types of Job Control
+
+- If a PR contains changes to the `.github` directory, a member of the
+  `snl-testing-admins` team must add the `CI-AT2_special_approval` tag to the
+  PR in order to kick off the autotesting.
+- For changes unrelated to the `.github` directory, any PR that is submitted
+  by a member of the `snl-testing` team, and *only contains commits* from
+  members of that team will automatically trigger this autotesting.
+- In the case that the PR is submitted by someone who is not a member of the
+  `snl-testing` team or contains commits from someone outside of that team,
+  an approving review by someone on the `snl-testing` team is required to
+  trigger autotesting.
+
+###### Disclaimer
+
+The above is according to Mike's current understanding of AT2 and may contain
+minor inaccuracies.
+This will be updated accordingly upon confirmation.
+
+
+## Development Details
+
+Most of the required configuration is provided by the AT2 docs and
+instructional Confluence page (on the Sandia network :confused:--reach out if
+you need access).
+However, some non-obvious choices and configurations are listed here.
+
+- To add some info to the testing output, we employ a custom action, cribbed
+  from E3SM/EAMxx, that prints out the workflow's trigger.
+
+### Hacks
+
+- For whatever reason, Skywalker does not like building in the
+  `gcc_12-3-0_cuda_12-1` container for the H100 GPU.
+  - This appears to be an issue of the (Haero?) build not auto-detecting the
+    correct Compute Capability (CC 9.0 => `sm_90`).
+  - To overcome this, we first obtain the CC flag via `nvidia-smi` within the
+    testing container.
+  - Then, we employ `sed` to manually change the `default_arch="sm_<xyz>"` of
+    the Haero-provided `nvcc_wrapper` (`haero_install/bin/nvcc_wrapper`).
+  - We follow up with a quick `grep` to confirm this.
+
+### Tokens
+
+- AT2 requires 2 fine-grained tokens for the **mam4xxSNL** account from the
+  `eagles-project` GitHub Organization in order to access information related
+  to the `mam4xx` repo.
+  - One token used to fetch and read/write runner information.
+    - **Expires 11 April 2026**
+  - One token used fetch and read repository information via the API.
+    - **Expires 2 May 2025**
+
+## TODO
+
+- [ ] Update job control section of README after the behavior is made clear.
+  - @mjschmdt271
+- [ ] Include a script to generate plots from within testing container?
+  - @jaelynlitz?
+- [ ] Unify all CI into a single top-level yaml file that calls the sub-cases.
+  - This should provide finer control over what runs and when.
+- [ ] Add testing for AMD GPUs on `caraway`.
+
+### Low-priority
+
+- [ ] Add CPU testing on `mappy` because "heck, why not?"
diff --git a/.github/workflows/at2_gcc-cuda.yml b/.github/workflows/at2_gcc-cuda.yml
@@ -0,0 +1,104 @@
+name: gcc_12-3_cuda_12-1
+
+on:
+  workflow_call
+
+jobs:
+  gcc-cuda:
+    runs-on:  [self-hosted, m4xci-snl-cuda, cuda, gcc]
+    # will run other tests in the matrix even if one fails
+    # NOTE: prioritizes extra info over speed, so consider whether this makes sense
+    continue-on-error: false
+    strategy:
+      fail-fast: true
+      matrix:
+        build-type: [Debug, Release]
+        fp-precision: [single, double]
+    name: gcc-cuda / ${{ matrix.build-type }} - ${{ matrix.fp-precision }}
+    steps:
+      - name: Check out the repository
+        uses: actions/checkout@v4
+        with:
+          persist-credentials: false
+          show-progress: false
+          submodules: recursive
+      - name: Cloning Haero
+        uses: actions/checkout@v4
+        with:
+          repository: eagles-project/haero
+          submodules: recursive
+          path: haero_src
+      - name: Show action trigger
+        uses: ./.github/actions/show-workflow-trigger
+      - name: Get CUDA Arch
+        # NOTE: for now, only running on an H100 machine, but keep anyway
+        run: |
+          # Ensure nvidia-smi is available
+          if ! command -v nvidia-smi &> /dev/null; then
+              echo "nvidia-smi could not be found. Please ensure you have Nvidia drivers installed."
+              exit 1
+          fi
+
+          # Get the GPU model from nvidia-smi, and set env for next step
+          gpu_model=$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n 1)
+          case "$gpu_model" in
+              *"H100"*)
+                  echo "H100 detected--setting Hopper90 architecture"
+                  echo "Hopper=ON" >> $GITHUB_ENV
+                  echo "CUDA_ARCH=90" >> $GITHUB_ENV
+                  ARCH=90
+                  ;;
+              *"A100"*)
+                  echo "A100 detected--setting Ampere80 architecture"
+                  echo "Ampere=ON" >> $GITHUB_ENV
+                  echo "CUDA_ARCH=80" >> $GITHUB_ENV
+                  ;;
+              *"V100"*)
+                  echo "V100 detected--setting Volta70 architecture"
+                  echo "Volta=ON" >> $GITHUB_ENV
+                  echo "CUDA_ARCH=70" >> $GITHUB_ENV
+                  ;;
+              *)
+                  echo "Unsupported GPU model: $gpu_model"
+                  exit 1
+                  ;;
+          esac
+      - name: Building Haero (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
+        run: |
+          cmake -S haero_src -B haero_build \
+            -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \
+            -DCMAKE_INSTALL_PREFIX="haero_install" \
+            -DCMAKE_C_COMPILER=gcc \
+            -DCMAKE_CXX_COMPILER=g++ \
+            -DHAERO_ENABLE_MPI=OFF \
+            -DHAERO_ENABLE_GPU=ON \
+            -DHAERO_PRECISION=${{ matrix.fp-precision }}
+          cd haero_build
+          make -j
+          make install
+      - name: Set nvcc_wrapper Arch
+        run: |
+          sed -i s/default_arch=\"sm_70\"/default_arch=\"sm_"$CUDA_ARCH"\"/g `pwd`/haero_install/bin/nvcc_wrapper
+          echo "===================================="
+          grep -i "default_arch=" `pwd`/haero_install/bin/nvcc_wrapper
+      - name: Configuring MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
+        run: |
+          cmake -S . -B build \
+            -DCMAKE_CXX_COMPILER=`pwd`/haero_install/bin/nvcc_wrapper \
+            -DCMAKE_C_COMPILER=gcc \
+            -DCMAKE_INSTALL_PREFIX=`pwd`/install \
+            -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} \
+            -DMAM4XX_HAERO_DIR=`pwd`/haero_install \
+            -DNUM_VERTICAL_LEVELS=72 \
+            -DENABLE_COVERAGE=OFF \
+            -DENABLE_SKYWALKER=ON \
+            -DCMAKE_CUDA_ARCHITECTURES=$CUDA_ARCH \
+            -G "Unix Makefiles"
+      - name: Building MAM4xx (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
+        run: |
+          cd build
+          make
+      - name: Running tests (${{ matrix.build-type }}, ${{ matrix.fp-precision }} precision)
+        run: |
+          cd build
+          ctest -V --output-on-failure
diff --git a/.github/workflows/at2_snl.yml b/.github/workflows/at2_snl.yml
@@ -0,0 +1,36 @@
+name: SNL-AT2
+
+on:
+  # Runs on PRs against main
+  pull_request:
+    branches: [ main ]
+    types: [opened, synchronize, ready_for_review, reopened]
+    paths:
+      # first, yes to these
+      - '.github/workflows/at2_snl.yml'
+      - 'src/mam4xx'
+      - 'src/tests'
+      - 'src/validation/**'
+      # second, no to these
+      - '!src/tests/data/**'
+      # not sure whether this should be disabled--keep for now
+      # - '!src/validation/mam_x_validation/**'
+
+  # Manual run
+  workflow_dispatch:
+
+  # # Add schedule trigger for nightly runs at midnight MT (Standard Time)
+  # schedule:
+  #   - cron: '0 7 * * *'  # Runs at 7 AM UTC, which is midnight MT during Standard Time
+
+concurrency:
+  # Two runs are in the same group if they are testing the same git ref
+  #  - if trigger=pull_request, the ref is refs/pull/<PR_NUMBER>/merge
+  #  - for other triggers, the ref is the branch tested
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  gcc-cuda:
+    uses:
+      ./.github/workflows/at2_gcc-cuda.yml