Skip to content

Conversation

@jmsexton03
Copy link
Contributor

@jmsexton03 jmsexton03 commented Nov 10, 2025

Summary

Comprehensive build system improvements for Cray/HPC systems with automatic detection and configuration. Adds 7 auto-fixes for common Cray build issues (CUDA+EKAT, fcompare, GPU-aware MPI, NetCDF). Includes developer utilities for organized testing, enhanced diagnostics with CMake 3.25+ log levels, and machine-specific profiles for Perlmutter, Frontier, Polaris, and Aurora. Detailed documentation is being developed in #2708 , and will include a simple table of working builds.


Details

Cray/HPC System Support

  • Auto-detects Cray environments and applies 7 build fixes (CUDA flags, MPI GTL libraries, NetCDF paths, HDF5 configuration)
  • Prevents MPI detection hangs with bare wrappers; manual MPI setup for Cray MPICH 8.x
  • Machine profiles with standardized module loading (Perlmutter, Frontier, Polaris, Aurora)
  • Auto-detects GPU architectures from CRAY_ACCEL_TARGET and environment variables

Build System Enhancements

  • Added distclean target (removes CMake cache/artifacts) and uninstall target with manifest tracking
  • Expanded .gitignore for build artifacts (build_/, install_/, CMakeCache.txt, generated files)
  • Wrapper scripts for clean+build workflows (interactive and CI/auto modes)
  • Robust ERF_DIR auto-detection with multi-method fallback and verification

NetCDF & Dependency Detection

  • Cascading pkg-config fallback for NetCDF variants (netcdf → netcdf-cxx4_parallel → netcdf_parallel)
  • GMake adds MPICH_DIR to PKG_CONFIG_PATH; NOAHMP tries netcdf-fortran → netcdf-fortran_parallel
  • Enhanced FindNetCDF with detection logging and helpful error messages
  • Auto-suggests module load commands on detection failures

Enhanced Diagnostics

  • CMake 3.25+ log levels (--log-level=VERBOSE/DEBUG/TRACE) with hierarchical message context
  • Generates cray_detected_config.cmake reference file showing auto-detected settings
  • Detection attempt logging in FindNetCDF and MPI configuration
  • Helpful error messages with resolution steps and auto-suggested machine profiles

@jmsexton03 jmsexton03 mentioned this pull request Nov 12, 2025
@jmsexton03
Copy link
Contributor Author

Questions

  1. Should this auto-detection be on by default? It's trying to not override any user-provided variables, but there is a CMake flag added to turn it on and off.
  2. Any feedback on completeness of the flags below from a kitchen - sink / physics / IO perspective?
  3. After I've rerun all the tests with current code and updated the table with the succeeding hashes (in this comment and in Docs PR #2708 ), can someone test on Kestrel something similar to:
source Build/machines/perlmutter_erf.profile
./cmake.sh
make distclean
./cmake_cuda.sh
make distclean
./cmake_with_kokkos_many_cuda.sh
rm -rf build_erf

Goal

The goal of this PR is to allow something like the following to work across systems, where all you have to change is the physics and IO flags of interest, and the specific gpu backend you're requesting (like https://github.com/jmsexton03/ERF/blob/add_craype_defaults_cmake/Build/cmake.sh or https://github.com/jmsexton03/ERF/blob/add_craype_defaults_cmake/Build/cmake_with_kokkos_many.sh)

#!/bin/bash

#Example cmake configuration script that assumes cray detection

cmake -DCMAKE_INSTALL_PREFIX:PATH=./install_erf \
      -DMPIEXEC_PREFLAGS:STRING=--oversubscribe \
      -DCMAKE_BUILD_TYPE:STRING=Release \
      -DERF_DIM:STRING=3 \
      -DERF_ENABLE_FFT:BOOL=ON \
      -DERF_ENABLE_NETCDF:BOOL=ON \
      -DERF_ENABLE_HDF5:BOOL=ON \
      -DERF_ENABLE_RRTMGP:BOOL=ON \
      -DERF_ENABLE_SHOC:BOOL=OFF \
      -DERF_ENABLE_MPI:BOOL=ON \
      -DERF_ENABLE_CUDA:BOOL=OFF \
      -DERF_ENABLE_HIP:BOOL=OFF \
      -DERF_ENABLE_SYCL:BOOL=OFF \
      -DERF_ENABLE_TESTS:BOOL=ON \
      -DERF_ENABLE_FCOMPARE:BOOL=ON \
      -DERF_ENABLE_DOCUMENTATION:BOOL=OFF \
      -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=ON \
      -B build_erf ..

cmake --build build_erf -j10 -v
cmake --install build_erf --prefix=install_erf

Since SHOC or P3 require an additional setup step, I'm aiming to test those separately.

Docs table

ERF provides several build scripts optimized for different systems and architectures. This table shows which scripts have been tested and verified on each system. Verified builds are marked with the git commit hash where they were last tested.

Build Script Perlmutter Frontier Aurora Polaris Kestrel RegtestCPU RegtestGPU
cmake.sh Untested Untested Untested Untested Untested Untested Untested
cmake_with_kokkos_many.sh Untested Untested Untested Untested Untested Untested Untested
cmake_with_kokkos_many_cuda.sh Untested Untested Untested Untested
cmake_with_kokkos_many_noradiation_hip.sh Untested
cmake_with_kokkos_many_sycl.sh Untested
Perlmutter/build_erf_with_shoc_cuda_Perlmutter.sh Untested Untested Untested Untested Untested Untested

Note: The build_erf_with_shoc_cuda_Perlmutter.sh script is being tested for cross-site compatibility with auto-detection enabled. A simplified version may work across CUDA-enabled HPC sites (Perlmutter, Polaris, Kestrel) with -DCRAY_AUTO_DETECTION=ON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants