diff --git a/README.md b/README.md index 11bb12f9..f568fc1e 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,13 @@ A metapackage for a unified for analysis and other pre- and post-processing for the Energy Exascale Earth System Model (E3SM). +## Documentation + +[Latest documentation](http://docs.e3sm.org/e3sm-unified/main) for users, +developers and maintainers. + +## Getting Started + E3SM-Unified currently supports Linux and OSX, and python >=3.9,<3.11. Support for Windows is not planned. diff --git a/dev-speck.txt b/dev-spec.txt similarity index 100% rename from dev-speck.txt rename to dev-spec.txt diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 00000000..3e51252c --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,55 @@ +# Contributing & Community + +We welcome contributions and feedback from all users and developers of +E3SM-Unified. Whether you're updating packages, improving documentation, or +reporting issues, your input helps strengthen the environment and its +community. + +--- + +## Ways to Contribute + +### ✏️ Documentation + +* Suggest improvements to the user guide or technical docs. +* Fix typos or clarify instructions. +* Add usage examples for tools you use regularly. + +### πŸš€ Suggest or Update Packages + +* Request new tools or features by opening a GitHub Issue. +* Propose version updates by: + * Editing the E3SM Confluence pages defining the next E3SM-Unified version + (if you have access) + * Or editing the `meta.yaml` (for conda package) or `defaults.cfg` (for spack + pacakges) and making a pull request (if you don't have access to E3SM's + Confluence pages). + +### βš™οΈ Development & Testing + +* Help test release candidates on supported platforms. +* Report issues you encounter. +* Contribute improvements to tools in the E3SM ecosystem (e.g., `mache`, + `mpas-analysis`, `zppy`, `e3sm_diags`). + +--- + +## Getting Started + +1. Fork the [e3sm-unified GitHub repository](https://github.com/E3SM-Project/e3sm-unified). +2. Create a new branch for your changes. +3. Submit a pull request (PR). +4. Tag reviewers as needed (e.g., `@xylar`). + +We recommend following our naming conventions for release branches (e.g., +`update-to-1.12.0`). + +--- + +## Communication + +* GitHub Issues: [E3SM-Unified GitHub](https://github.com/E3SM-Project/e3sm-unified/issues) +* Slack: `#e3sm-help-postproc` +* Email: [xylar@lanl.gov](mailto:xylar@lanl.gov) + +Have questions about where to start? Just ask on Slack or open an issue! diff --git a/docs/index.md b/docs/index.md index 9f5881b2..863fcdc8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,11 +1,49 @@ +# Welcome to E3SM-Unified Documentation + ```{image} logo/e3sm_unified_logo_200.png :align: center :width: 200 px ``` -# E3SM-Unfied +E3SM-Unified is a unified conda + Spack environment developed to support pre- +and post-processing workflows for the Energy Exascale Earth System Model +(E3SM). It bundles commonly used analysis, visualization, and workflow tools +into a single portable environment. + +This documentation is for both **users** of the E3SM-Unified environment and +**developers** who build, test, and deploy it on supported platforms. + +--- + +## πŸš€ Start Here + +* πŸ“– [Introduction](introduction.md): What E3SM-Unified is and why you should + use it +* πŸ§ͺ [Quickstart Guide](quickstart.md): Load the environment and start using + tools + +--- + +## πŸ“š Contents + +* πŸ’  [Using E3SM-Unified Tools](using-tools.md) +* πŸ§ͺ [Testing Release Candidates](testing-release-candidates.md) +* 🚚 [The E3SM-Unified Release Workflow](releasing/release-workflow.md) +* πŸ“¦ [Package Catalog](packages.md) +* ❓ [Troubleshooting & FAQs](troubleshooting.md) +* 🀝 [Contributing & Community](contributing.md) + +--- + +## πŸ’¬ Get Help + +```{admonition} Support +- Slack: #e3sm-help-postproc +- GitHub Issues: [E3SM-Unified on GitHub](https://github.com/E3SM-Project/e3sm-unified/issues) +- Maintainer contact: xylar@lanl.gov +``` + +--- -A metapackage for a unified -[conda environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -for analysis and other pre- and post-processing for the Energy Exascale Earth -System Model (E3SM). +This documentation is maintained by the E3SM Infrastructure Team and +contributors across the community. diff --git a/docs/introduction.md b/docs/introduction.md new file mode 100644 index 00000000..1bc59955 --- /dev/null +++ b/docs/introduction.md @@ -0,0 +1,55 @@ +# Introduction to E3SM-Unified + +```{image} logo/e3sm_unified_logo_200.png +:align: center +:width: 200 px +``` + +## What is E3SM-Unified? + +E3SM-Unified is a unified conda-based environment that provides pre- and +post-processing tools for the Energy Exascale Earth System Model (E3SM). It is +designed to streamline the analysis, visualization, and transformation of model +output for scientists and developers. + +This environment bundles together a curated set of Python and compiled tools +that work well across supported platforms, particularly on E3SM-managed +high-performance computing (HPC) systems. + +## Key Features + +* Combines dozens of packages into one environment, eliminating setup friction. +* Maintains consistency across HPC platforms (Anvil, Chrysalis, Compy, etc.). +* Offers both Conda and Spack-installed components for MPI performance. +* Fully open source and community-maintained via GitHub. + +## Common Use Cases + +* Diagnostics and evaluation (e.g., `e3sm_diags`, `MPAS-Analysis`) +* CMIP output conversion (`e3sm_to_cmip`) +* Time series generation, viewer creation, and archiving (`zppy`, `zstash`) +* Domain generation and mesh visualization (`cime_gen_domain`, `mosaic`) + +## Supported Platforms + +E3SM-Unified is available on many E3SM-supported systems: + +* Andes +* Anvil +* Chrysalis +* Compy +* Dane +* Frontier +* Perlmutter +* Polaris (ALCF) +* Ruby + +It can also be installed on Linux or macOS laptops for limited use (see +[Quickstart Guide](quickstart.md)). Windows is not supported. + +## Getting Help + +* [Quickstart Guide](quickstart.md) +* GitHub: [E3SM-Unified repository](https://github.com/E3SM-Project/e3sm-unified) +* Slack: `#e3sm-help-postproc` +* Issues/questions: GitHub Issues or contact [xylar@lanl.gov](mailto:xylar@lanl.gov) diff --git a/docs/logo/e3sm_unified_logo.png b/docs/logo/e3sm_unified_logo.png index 45a6f044..ba534b7e 100644 Binary files a/docs/logo/e3sm_unified_logo.png and b/docs/logo/e3sm_unified_logo.png differ diff --git a/docs/logo/e3sm_unified_logo_200.png b/docs/logo/e3sm_unified_logo_200.png index 227fe2e8..01ef7e99 100644 Binary files a/docs/logo/e3sm_unified_logo_200.png and b/docs/logo/e3sm_unified_logo_200.png differ diff --git a/docs/packages.md b/docs/packages.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..5ccf8b95 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,98 @@ +# Quickstart Guide + +```{note} +E3SM-Unified is supported only on Linux, OSX and HPC platforms. It is **not** +supported on Windows. +``` + +## Accessing E3SM-Unified on Supported Machines + +On most E3SM-supported HPC systems, E3SM-Unified is already installed and +ready to use via an activation script. + +### Example Activation Commands + +```bash +# Andes +source /ccs/proj/cli115/software/e3sm-unified/load_latest_e3sm_unified_andes.sh + +# Anvil +source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_anvil.sh + +# Chrysalis +source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh + +# Compy +source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified_compy.sh + +# Dane +source /usr/workspace/e3sm/apps/e3sm-unified/load_latest_e3sm_unified_dane.sh + +# Frontier +source /ccs/proj/cli115/software/e3sm-unified/load_latest_e3sm_unified_frontier.sh + +# Perlmutter +source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh + +# Polaris (ALCF) +source /lus/grand/projects/E3SMinput/soft/e3sm-unified/load_latest_e3sm_unified_polaris.sh + +# Ruby +source /usr/workspace/e3sm/apps/e3sm-unified/load_latest_e3sm_unified_ruby.sh +``` + +Once the script is sourced, you'll have access to all the tools provided by +E3SM-Unified in your environment. + +## Verifying Installation + +After activation, you can verify that the environment is correctly loaded by +testing if major packages are importable: + +```python +python -c "import xarray, e3sm_diags, mpas_analysis, zppy" +``` + +## Running on Compute Nodes (Optional but Recommended) + +Many E3SM-Unified tools (e.g., MOAB, MPAS-Analysis, NCO, TempestRemap) benefit +from running on compute nodes using MPI-enabled system builds. + +Check your system documentation for how to launch interactive compute sessions +(e.g., `srun`, `salloc`, or `qsub`). + +## Installing E3SM-Unified on an Unsupported System + +E3SM-Unified is not officially supported on Linux or Mac laptops or +workstations, but users can install it using `miniforge3`. + +### Step-by-Step (Linux/macOS): + +```bash +# Install miniforge3 +wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" +bash "Miniforge3-$(uname)-$(uname -m).sh" + +# Create a new environment +conda create -n esm-unified -c conda-forge e3sm-unified + +# Activate it +conda activate e3sm-unified +``` + +Note: On macOS with M1/M2 chips, install the x86\_64 version and use Rosetta 2 +for compatibility. + +--- + +## Related Pages + +* [Introduction to E3SM-Unified](introduction.md) +* [Using E3SM-Unified Tools](using-tools.md) +* [Troubleshooting](troubleshooting.md) + +```{admonition} Need Help? +- Slack: #e3sm-help-postproc +- GitHub Issues: https://github.com/E3SM-Project/e3sm-unified/issues +- Email: xylar@lanl.gov +``` diff --git a/docs/releasing/adding-new-machines.md b/docs/releasing/adding-new-machines.md new file mode 100644 index 00000000..7fb75105 --- /dev/null +++ b/docs/releasing/adding-new-machines.md @@ -0,0 +1,91 @@ +# Adding a New Machine + +Support for a new HPC machine in E3SM-Unified requires coordinated updates +across multiple tools β€” primarily in +[`mache`](https://github.com/E3SM-Project/mache), but also in the E3SM Spack +fork and deployment scripts. + +This page provides guidance for E3SM-Unified maintainers and infrastructure +developers integrating new machines into the release and deployment workflow. + +--- + +## πŸ”— Main Mache Documentation + +Most of the process is already documented in the official `mache` developer +guide: + +* [Adding a New Machine](https://docs.e3sm.org/mache/main/developers_guide/adding_new_machine.html) +* [Adding Spack Support](https://docs.e3sm.org/mache/main/developers_guide/spack.html) + +Start in `mache` to: + +* Add a machine-specific config file (e.g., `pm-cpu.cfg`) +* Add hostname detection logic in `discover.py` +* Create Spack templates for supported compiler/MPI stacks +* Optionally add shell script templates for environment setup + +> ⚠️ Machines not listed in the E3SM + [`config_machines.xml`](https://github.com/E3SM-Project/E3SM/blob/master/cime_config/machines/config_machines.xml) must first be added upstream before `mache` + can support them. + +--- + +## 🧩 Integration with E3SM-Unified Deployment + +After updating `mache`, you'll need to: + +1. **Reference your `mache` branch in E3SM-Unified Deployment** + + * Use the `--mache_fork` and `--mache_branch` flags to deploy using the + updated branch + * Confirm the new machine is recognized and templates are applied correctly + +2. **Update Spack if needed** + + * If new versions of external tools are required, update the + [`spack_for_mache_`](spack-updates.md) branch of the + [E3SM Spack fork](https://github.com/E3SM-Project/spack) + +--- + +## βœ… Testing Your Changes + +Use the standard test deployment approach from +[Deploying on HPCs](deploying-on-hpcs.md): + +```bash +cd e3sm_supported_machines +./deploy_e3sm_unified.py --conda ~/miniforge3 \ + --mache_fork \ + --mache_branch +``` +You can also supply these flags: +``` + --machine \ + --compiler \ + --mpi \ +``` +but they should not be needed if you have set things up in `mache` correctly. + +During testing, focus on: + +* Spack external package detection and successful builds +* Shell script generation and activation behavior +* Module compatibility and performance of tools like `zppy` and `e3sm_diags` + +--- + +## πŸ’‘ Tips and Best Practices + +* Reuse YAML templates from similar machines to minimize effort +* Add common system tools as `buildable: false` in the Spack environment +* Avoid identifying machines using environment variables unless absolutely + necessary. Instead use the hostnames for login and compute nodes if + possible +* Use `utils/update_cime_machine_config.py` to verify `mache` remains in sync + with E3SM + +--- + +➑ Next: [Publishing the Final Release](publishing-final-release.md) diff --git a/docs/releasing/conda-vs-spack.md b/docs/releasing/conda-vs-spack.md new file mode 100644 index 00000000..8762061a --- /dev/null +++ b/docs/releasing/conda-vs-spack.md @@ -0,0 +1,106 @@ +# How Conda and Spack Work Together in E3SM-Unified + +E3SM-Unified uses a hybrid approach that combines Conda and Spack to build +and deploy a comprehensive software environment for E3SM analysis and +diagnostics. This page explains the motivation for this strategy, how the +components interact, and the shared infrastructure that supports both +E3SM-Unified and related projects. + +--- + +## Why Combine Conda and Spack? + +Each tool solves a different part of the problem: + +### βœ… Conda + +* Excellent for managing Python packages and their dependencies +* Supports rapid installation and reproducibility +* Compatible with conda-forge and custom channels (e.g., `e3sm`) +* User-friendly interface, especially for scientists and developers + +### βœ… Spack + +* Designed for building performance-sensitive HPC software +* Allows fine-grained control over compilers, MPI implementations, and system + libraries +* Better suited for tools written in Fortran/C/C++ with MPI dependencies + (e.g., NCO, MOAB, TempestRemap) + +### ❗ The Challenge + +Neither system alone is sufficient: + +* Conda cannot reliably build or run MPI-based binaries across multiple nodes + on HPC systems. In our experience, Conda's MPI implementations often fail + even for multi-task jobs on a single node, making them unsuitable for + high-performance parallel workflows +* Spack lacks strong support for modern Python environments and is generally + harder to use for scientists accustomed to Conda-based workflows. While + conda-forge provides access to tens of thousands of Python packages, Spack + offers far fewer, meaning many familiar scientific tools are not readily + available through Spack alone + +--- + +## Architecture: How They Work Together + +E3SM-Unified environments: + +1. Use **Conda** to install the core Python tools and lightweight dependencies +2. Rely on **Spack** to build performance-critical tools outside Conda +3. Are bundled into a single workflow that ensures compatibility across both + +System-specific setup scripts (e.g., `load_latest_e3sm_unified_.sh`) +ensure both components are activated correctly. + +For MPI-based tools: + +* The tools are built with Spack using system compilers and MPI +* Users automatically access these builds when running on compute nodes + +--- + +## Shared Infrastructure + +E3SM-Unified, Polaris, and Compass all rely on the same key components: + +* [`mache`](https://github.com/E3SM-Project/mache): A configuration library + for detecting machine-specific settings (modules, compilers, paths) +* [E3SM's Spack fork](https://github.com/E3SM-Project/spack): Centralized + control over package versions and build settings +* Conda: Used consistently to install `mache`, lightweight tools, and Python + dependencies + +This shared foundation ensures reproducibility and consistency across +workflows, testbeds, and developer tools in the E3SM ecosystem. + +--- + +## Future Alternatives + +As complexity grows, other strategies may be worth evaluating: + +### Option: **E4S (Extreme-scale Scientific Software Stack)** + +* Spack-based stack of curated HPC tools +* E4S environments aim to replace the need for manual Spack+Conda integration +* May offer better long-term sustainability, but lacks Python focus today + +πŸ”— [Explore E4S](https://e4s.io) + +### Other Approaches (less suitable currently): + +* Pure Spack builds (harder for Python workflows) +* Pure Conda builds (harder for HPC performance tools) +* Containers (portability gains, but complex for HPC integration) + +--- + +## Summary + +The hybrid Conda + Spack model in E3SM-Unified balances ease of use with HPC +performance. While more complex to maintain, it provides flexibility, +compatibility, and performance across diverse systems. Shared infrastructure +(like `mache` and E3SM's Spack fork) reduces duplication across projects and +streamlines the release process. diff --git a/docs/releasing/creating-rcs/overview.md b/docs/releasing/creating-rcs/overview.md new file mode 100644 index 00000000..e82eb322 --- /dev/null +++ b/docs/releasing/creating-rcs/overview.md @@ -0,0 +1,59 @@ +# Creating Release Candidates + +E3SM-Unified and its core dependencies follow a structured release process +that relies on **release candidates (RCs)**. These pre-release versions are +used for testing and validation before an official release is finalized and +deployed. + +This section describes how to create RCs for both individual dependencies +(like `e3sm_diags` or `mpas-analysis`) and for the `e3sm-unified` metapackage +itself. It also includes tools and tips for troubleshooting build failures. + +--- + +## What Is a Release Candidate? + +A release candidate (RC) is a build intended for testing before a final +release. RCs allow us to validate compatibility across the E3SM analysis stack +and to ensure that tools and environments function correctly on supported HPC +platforms. + +RC packages are published to special Conda labels (like `e3sm_diags_dev` or + `e3sm_unified_dev`) to keep them separate from stable releases. + +--- + +## Overview of the Process + +There are two major workflows: + +### 1. Creating RCs for Dependency Packages + +These are individual tools like `e3sm_diags`, `zppy`, or `mpas_analysis` that +are used within the E3SM-Unified environment. + +Go to: [Creating RCs for Dependency Packages](rc-dependencies.md) + +--- + +### 2. Creating an RC for E3SM-Unified + +This involves assembling a full test environment based on specific versions of +all dependencies β€” including RCs. + +Go to: [Creating an E3SM-Unified RC](rc-e3sm-unified.md) + +--- + +### 3. Troubleshooting Build Failures + +Solving Conda environments during builds can fail for complex or subtle +reasons. This section provides detailed strategies for debugging, including +use of `conda_first_failure.py`. + +Go to: [Troubleshooting Conda Build Failures](rc-troubleshooting.md) + +--- + +Each page includes step-by-step examples, commands, and best practices +tailored to the E3SM-Unified release workflow. diff --git a/docs/releasing/creating-rcs/rc-dependencies.md b/docs/releasing/creating-rcs/rc-dependencies.md new file mode 100644 index 00000000..bb304082 --- /dev/null +++ b/docs/releasing/creating-rcs/rc-dependencies.md @@ -0,0 +1,127 @@ +# Creating RCs for Dependency Packages + +This page describes how to create release candidates (RCs) for packages that +are included in the E3SM-Unified environment, such as `e3sm_diags`, +`mpas-analysis`, `zppy`, and `zstash`. + +We use `e3sm_diags` as a concrete example, but the process is similar for all +E3SM-developed dependencies. + +--- + +## Step-by-Step: Creating an RC for `e3sm_diags` + +### 1. Tag a Release Candidate in the Source Repo + +Go to the source repository: +[E3SM Diags GitHub](https://github.com/E3SM-Project/e3sm_diags) + +Create a release tag: + +```bash +git checkout main +git fetch --all -p +git reset --hard origin/main +git tag v3.0.0rc1 +git push origin v3.0.0rc1 +``` + +**Note:** + +* `e3sm_diags` uses a `v` prefix in version tags (e.g., `v3.0.0rc1`) as part + of its established convention. +* For new packages, it’s recommended to follow + [Semantic Versioning](https://semver.org/) and omit the `v` prefix (i.e., + tag as `3.0.0rc1`). + +--- + +### 2. Prepare the Feedstock PR + +Go to the conda-forge feedstock for `e3sm_diags`: +[E3SM Diags Feedstock](https://github.com/conda-forge/e3sm_diags-feedstock) + +If a `dev` branch does not already exist: + +* Clone the feedstock repo locally +* Create a new branch off `main` called `dev` +* Push it to the origin + + **Note:** By making no changes from the `main` branch, you ensure that no + new packages will be created when you push the `dev` branch to the origin + +### 3. Fork the Feedstock and Create a PR + +1. Fork the feedstock repo to your GitHub account. +2. In your fork, create a new branch (e.g., `update-v3.0.0rc1`). + + **Important:** Do **not** create branches directly on the main conda-forge + feedstock. All changes should go through a pull request from your personal + fork. Creating a branch on the main feedstock can trigger package builds + before your updates have been properly tested or reviewed. (e.g., + `update-v3.0.0rc1`). + +3. Edit `recipe/meta.yaml`: + +* Update the `version` field to match your RC tag (e.g., `v3.0.0rc1`) +* Set the `sha256` hash. To determine the hash, you need to download the + source file on a Linux (e.g. HPC) machine and run `sha256sum` on it. For + some reason, Macs seem to produce an incorrect hash. +* Update dependencies if needed (e.g., pin to RC versions of other tools) + +4. If you created the `dev` branch above and no previous release candidates + have been added, you will need to add `recipe/conda_build_config.yaml` with + contents like: + + ``` yaml + channel_targets: + - conda-forge e3sm_diags_dev + ``` + + The label is the name of the package with any `-` replaced by `_`, followed + by `_dev`. + +5. Commit the changes and push them to the branch on your fork (unless editing + on GitHub directly). + +6. Open a pull request: + + * **Source:** your RC branch on your fork (head repository) + * **Target:** the `dev` branch on the conda-forge feedstock (base + repository) + +--- + +### 4. Merge the PR Once CI Passes + +After CI completes successfully: + +* Review the logs if needed +* Merge the PR into the `dev` branch + +The RC build will now be published to: + +``` +conda-forge/label/e3sm_diags_dev +``` + +You can test the RC by installing it like so: + +```bash +conda install -c conda-forge/label/e3sm_diags_dev e3sm_diags +``` + +--- + +## Summary + +Creating an RC for a dependency involves: + +1. Tagging the source repository +2. Opening a PR on the feedstock targeting the `dev` branch +3. Waiting for CI to pass, then merging + +This process enables E3SM-Unified maintainers to incorporate the RC version of +your package into a unified test build. + +➑ Next: [Creating an RC for E3SM-Unified](rc-e3sm-unified.md) diff --git a/docs/releasing/creating-rcs/rc-e3sm-unified.md b/docs/releasing/creating-rcs/rc-e3sm-unified.md new file mode 100644 index 00000000..1e0f7a33 --- /dev/null +++ b/docs/releasing/creating-rcs/rc-e3sm-unified.md @@ -0,0 +1,174 @@ +# Creating an RC for E3SM-Unified + +Once release candidates (RCs) of core E3SM packages (like `e3sm_diags`, +`mpas-analysis`, etc.) have been published, an RC version of the E3SM-Unified +metapackage can be built and tested. + +This guide walks through creating a release candidate of `e3sm-unified` based +on those RC dependencies. + +--- + +## 1. Create a Branch for the New Version + +Create a feature branch on your fork of `e3sm-unified`, typically called: + +```bash +update-to- +``` + +Example: + +```bash +git checkout -b update-to-1.12.0 +``` + +--- + +## 2. Update the Conda Recipe + +Edit `recipes/e3sm-unified/meta.yaml`: + +* Update the `version` field to match the RC version (e.g., `1.12.0rc1`) +* Update the list of dependencies and versions, including RCs +* Be sure to include the correct version for each core tool (e.g., + `e3sm_diags`, `mpas-analysis`, etc.) + +--- + +## 3. Regenerate the Build Matrix + +Run the matrix generator script to define combinations of Python and MPI: + +```bash +cd recipes/e3sm-unified/configs +rm *.yaml +python generate.py +``` + +This produces matrix files like: + +* `mpi_mpich_python3.10.yaml` +* `mpi_hpc_python3.10.yaml` + +--- + +## 4. Edit `build_package.bash` + +Update the channel list to include dev labels for any packages still in RC +form. For example: + +```bash +channels="-c conda-forge/label/chemdyg_dev \ + -c conda-forge/label/e3sm_diags_dev \ + -c conda-forge/label/mache_dev \ + -c conda-forge/label/mpas_analysis_dev \ + -c conda-forge/label/zppy_dev \ + -c conda-forge/label/zstash_dev \ + -c conda-forge" +``` + +Then define which matrix files to test. For example: + +```bash +for file in configs/mpi_mpich_python3.10.yaml configs/mpi_hpc_python3.10.yaml +do + conda build -m $file --override-channels $channels . +done +``` + +Make sure: + +* You use `--override-channels` to isolate testing to dev packages +* You only include dev labels for packages with RCs β€” use stable versions + otherwise + +--- + +## 5. Build and Troubleshoot + +Run the script: + +```bash +bash build_package.bash +``` + +If builds fail, consult the +[Troubleshooting Conda Build Failures](rc-troubleshooting.md) guide. +This includes how to use `conda_first_failure.py` to debug dependency +resolution issues. + +--- + +## 6. Make a draft PR + +Push the branch to your fork of `e3sm-unified` and make a draft PR to the +main `e3sm-unified` repo. Use that PR to document progress and highlight +important version updates in this release for the public (those without +acces to E3SM's Confluence pages). See +[this example](https://github.com/E3SM-Project/e3sm-unified/pull/125). + +--- + +## 7. Keeping updated on Confluence + +As deployment and testing progresses, you needs to make sure that the packages +in your `update-to-` branch match the +[agreed-upon versions on Confluence](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#Next-versions). +Maintainers of dependencies will need to inform you as new release candidates +or final releases become available, preferably by updating Confluence and also +sending a Slack message or email. + +As testing nears completion, it is also time to draft a release note, similar +to [this example](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/4908515329/E3SM-Unified+1.11.0+release+notes). +Ask maintainers of any of the main E3SM-Unified packages that have been +updated since the last release to describe (**briefly and with minimal +jargon**) what is new in their package that would be of interest to users. + +--- + +## 8. Tag and Publish the RC + +After test builds are successful: + +### Tag a Release Candidate + +Tag your `update-to-` branch in the `e3sm-unified` repo: + +```bash +git checkout update-to-1.12.0 +git tag 1.12.0rc1 +git remote add E3SM-Project/e3sm-unified git@github.com:E3SM-Project/e3sm-unified.git +git fetch --all -p +git push E3SM-Project/e3sm-unified 1.12.0rc1 +``` + +### Create a Conda-Forge PR + +1. Fork the [`e3sm-unified-feedstock`](https://github.com/conda-forge/e3sm-unified-feedstock) + +2. Create a new branch in your fork (e.g., `update-1.12.0rc1`) + +3. Edit `recipe/meta.yaml`: + + * Update the `version` field (e.g., `1.12.0rc1`) + * Update all dependencies to match the versions in your + `update-to-` branch of `e3sm-unified` repo + + ⚠️ **Reminder:** The feedstock’s `meta.yaml` is the authoritative source + for the Conda package. The one in the `e3sm-unified` repo is for testing + and provenance only. + +4. Open a PR from your fork β†’ `dev` branch on the feedstock + +5. Merge once CI passes + +The RC package will now be available under the label: + +``` +conda-forge/label/e3sm_unified_dev +``` + +It’s ready to be tested and deployed on HPC systems. + +➑ Next: [Deploying on HPCs for Testing](../testing/deploying-on-hpcs.md) diff --git a/docs/releasing/creating-rcs/rc-troubleshooting.md b/docs/releasing/creating-rcs/rc-troubleshooting.md new file mode 100644 index 00000000..d8abc96e --- /dev/null +++ b/docs/releasing/creating-rcs/rc-troubleshooting.md @@ -0,0 +1,119 @@ +# Troubleshooting Conda Build Failures + +When building a release candidate (RC) of E3SM-Unified, it's common to +encounter solver or build failures due to dependency conflicts, pinning +mismatches, or version incompatibilities. + +This page outlines common issues and how to debug them effectively. + +--- + +## Common Failure: Conda Solver Errors + +The most frequent issue occurs during environment solving: + +```bash +ResolvePackageNotFound: + - some_package=1.2.3 +``` + +Or more subtly: + +```bash +Found conflicts! Looking for incompatible packages. +...UnsatisfiableError: The following specifications were found to be incompatible... +``` + +These often stem from: + +* Conflicting dependencies across packages +* Incompatible versions due to partially-completed conda-forge migrations +* Conda environment solver hitting internal limits + +--- + +## Strategy: Use `conda_first_failure.py` + +To help identify the root cause, E3SM-Unified provides: + +``` +recipes/e3sm-unified/conda_first_failure.py +``` + +This utility performs a dry-run install using a list of dependencies, then +uses bisection to find the first package that causes solver failure. + +### Usage + +1. Copy the list of dependencies from `meta.yaml` β†’ `build:` section into a + text file (e.g., `specs.txt`) + +2. Constrain the python version to a single minor version, e.g.: + + ``` yaml + - python >=3.10,<3.11 + ``` + +3. Remove or replace any jinja2 emplating or conflicthing selector comments, + for example: + + * replace `{{ mpi_prefix }}` with `mpi_mpich` or `nompi` + * depending on the python version you are testing, pick only one of: + ``` yaml + - pyproj 3.6.1 # [py<310] + - pyproj 3.7.0 # [py>=310] + ``` + +4. Run the script: + +```bash +python conda_first_failure.py specs.txt +``` + +5. The script will print the **first package** that causes a conflict. + +### Interpreting Results + +* The failing package might not be the root issue β€” it may simply conflict + with another dependency in the list. +* To explore this, move the failing package to the **top** of the list and + re-run the script. The new failure likely points to a **conflicting pair**. +* Examine the dependencies (via the respective conda-forge feedstocks) of + the conflicting pair of packages to see if you can understand the conflict. + +--- + +## Advanced Debugging Tips + +* Add transitive dependencies to the `specs.txt` file (e.g., if the problem + might involve `hdf5`, add `libnetcdf`, `netcdf4`, etc.) +* Compare dependency trees using: + +```bash +conda create --dry-run -n test-env +``` + +* Use `conda search ` with `--info` to inspect available versions + and build strings + +--- + +## When You’re Stuck + +As the E3SM-Unified maintainer, you're likely the most experienced person on +the team when it comes to Conda packaging and dependency resolution. + +If the conflict is particularly subtle or deep within upstream packages: + +* Dig into transitive dependencies (e.g., `conda-tree` can be useful) +* Inspect recent changes to pinned versions in conda-forge +* Examine dependency metadata from feedstocks + +When further help is needed, reach out directly to: + +* Other E3SM tool maintainers (e.g., for `e3sm_diags`, `mpas_analysis`, etc.) +* Colleagues with Spack or Conda-forge packaging experience + +Ultimately, it’s up to the release engineer to resolve these issues through +investigation, collaboration, or temporary workarounds until a proper fix is +found. diff --git a/docs/releasing/finalizing-release.md b/docs/releasing/finalizing-release.md new file mode 100644 index 00000000..2db8d26f --- /dev/null +++ b/docs/releasing/finalizing-release.md @@ -0,0 +1,124 @@ +# Publishing the Final Release + +Once all dependencies have been tested and validated, and the E3SM-Unified +release candidate (RC) has passed testing across the relevant HPC systems, the +final release can be published. This page outlines the process of finalizing +and distributing an official E3SM-Unified release. + +--- + +## βœ… Pre-Release Checklist + +Before publishing: + +* [ ] All RC versions of dependencies (e.g., `e3sm_diags`, `zppy`, `mache`) + have been released with final version tags and conda-forge packages +* [ ] Final version of `e3sm-unified` has been created and built on conda-forge +* [ ] Final deployments have been completed on all target HPC machines +* [ ] Smoke testing and key workflows (e.g., `zppy`, `mpas_analysis`) have + been validated + +--- + +## Step-by-Step Finalization + +### 1. Remove RC Labels + +Edit `recipes/e3sm-unified/meta.yaml` and: + +* Replace RC versions of dependencies (e.g., `3.0.0rc2`) with final versions + (e.g., `3.0.0`) in both `meta.yaml` and `default.cfg` +* Bump the `e3sm-unified` version accordingly (e.g., from `1.12.0rc3` to + `1.12.0`) in `meta.yaml` and `e3sm_supported_machines/shared.py` + +Commit the changes to your `update-to-` branch. + +### 2. Tag Final Release in Source Repo + +If you followed the suggested workflow under +[Creating an RC for E3SM-Unified](creating-rcs/rc-e3sm-unified.md), you should +have a draft PR from your `update-to-` branch that documents the +changes. Merge this PR into `main` so the release history and testing context +are preserved. + +Then, go to `Releases` on the right on the +[main page](https://github.com/E3SM-Project/e3sm-unified) of the repo and +click `Draft a new release` at the top. + +Document the changes in this version (hopefully just copy-paste from the +description of your recently merged PR), similar to +[this example](https://github.com/E3SM-Project/e3sm-unified/releases/tag/1.11.0). + +### 3. Submit Final Feedstock PR + +Go to the [e3sm-unified-feedstock](https://github.com/conda-forge/e3sm-unified-feedstock): + +* Open a pull request from your fork +* Update the version number and `sha256` hash. +* Target the `main` branch (not `dev`) +* Ensure final versions of all dependencies are listed + +Once CI passes, merge the PR. + +This will trigger CI to publish the new release to the standard conda-forge +channel. You typically need to wait as long as an hour after packages have +built for them to become available for installation. You can watch +[this page](https://anaconda.org/conda-forge/e3sm-unified/files) +to see when files appear and how many downloads they have. Once all files have +been built and show 2 or more downloads, you should be good to proceed with +final deployment. + +### 4. Deploy Final Release on HPC Systems + +Use the same process as during RC testing, but now with the `--release` flag: + +```bash +./deploy_e3sm_unified.py --conda ~/miniforge3 --release +``` + +This creates new activation scripts like: + +* `load_e3sm_unified__.sh` + +Also generates symlinks like: + +* `load_latest_e3sm_unified_.sh` + +### 5. Announce the Release + +Share the release: + +* πŸ“ **Confluence** [Like this example](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/4908515329/E3SM-Unified+1.11.0+release+notes) +* **Email** to [E3SM All-hands](https://e3sm.atlassian.net/wiki/spaces/ED/pages/818381294/Email+Lists) list (same contents as Confluence page) +* πŸ“£ **Slack** (`#e3sm-help-postproc`) with release highlights + +Be sure to include: + +* Final versions of core E3SM-developed packages (e.g., `mpas_analysis`, + `zppy`) +* List of supported HPC machines and activation instructions +* Summary of major changes, fixes, and new features + +--- + +## πŸ” Post-Release Maintenance + +On each supported machine: + +* Clean up outdated `test_...` activation scripts +* Remove conda and spack environments for E3SM-Unified RCs +* Delete the `update-to-` branch +* Move the contents on Confluence describing the + [current version](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#Current-Version) + to the top of the + [previous versions](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/3236233332/Packages+in+previous+versions+E3SM+Unified+conda+environment) page +* Copy the contents of the next version to be the new current version +* Update the the version under "next version" and remove all bold (to indicate + that, as a starting point, no updates have been made to any packages in the + next version) +* Move any release notes for older E3SM-Unified versions into the Confluence + subdirectory for [previous versions](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/3236233332/Packages+in+previous+versions+E3SM+Unified+conda+environment). + +--- + +➑ Next: [Maintaining Past Versions](maintaining-past-versions.md) diff --git a/docs/releasing/maintaining-past-versions.md b/docs/releasing/maintaining-past-versions.md new file mode 100644 index 00000000..2082997b --- /dev/null +++ b/docs/releasing/maintaining-past-versions.md @@ -0,0 +1,98 @@ +# Maintaining Past Versions + +After a new version of E3SM-Unified is released, older versions may still be +in use for months or years by analysis workflows, diagnostic pipelines, or +collaborators working on older datasets. This page outlines best practices for +keeping past versions available and usable. + +--- + +## 🎯 Goals + +* Ensure long-term reproducibility +* Avoid breaking existing workflows +* Minimize overhead for maintainers +* Free up limited disk space when required + +--- + +## πŸ”’ Avoid Breaking Changes + +### Don’t Delete Spack or Conda Environments + +E3SM-Unified installs are isolated by version. Do not delete directories like: + +```bash +/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.11.0/ +/lcrc/soft/climate/e3sm-unified/spack/e3sm_unified_1.11.0_chrysalis_gnu_mpich/ +``` + +These environments may be used by others via scripts, batch jobs, or notebooks. + +**Exception**: If the environment is broken beyond repair and cannot be +recreated, it should be removed. If there is no more disk space for software, +the oldest environments must be deleted to make room for new ones. Use your +best judgment and document removals on Confluence. + +### Don’t Remove Activation Scripts + +Keep activation scripts for previous versions (e.g., +`load_e3sm_unified_1.11.0_chrysalis.sh`) in place. + +**Exception**: If the environment has been removed, it is safe to remove the +associated activation scripts. + +--- + +## 🧹 What Can Be Removed + +### Test Environments + +You can safely delete environments or activation scripts for +**release candidates**: + +* `test_e3sm_unified_1.11.0rc3_*.sh` +* Conda environments like `test_e3sm_unified_install` + +These were used only during internal testing and should be removed when they +are no longer needed to free up disk space. + +### Intermediate Build Artifacts + +Temporary logs or caches (e.g., from failed deployments) can be removed to +save space. + +--- + +## πŸ” Rebuilding Past Versions + +If a past version breaks due to: + +* OS upgrades +* Module stack changes +* File system reorganizations + +...you may need to rebuild that version. Follow these steps: + +1. Checkout the appropriate tag in the `e3sm-unified` repo (e.g., `1.11.0`) +3. Use `deploy_e3sm_unified.py` with the `--version` flag (as a precaution): + +```bash +./deploy_e3sm_unified.py --conda ~/miniforge3 --version 1.11.0 --release --recreate +``` + +You may run into difficulty solving for older conda environments e.g. because +of packages that have been marked as broken in the interim. At some point, it +may simply not be possible to recreate older E3SM-Unified conda environments +because of this. + +--- + +## πŸ’¬ Communication + +* Coordinate cleanup of old versions via Slack (`#e3sm-help-postproc`) +* Use Confluence notes to document version removals or rebuilds + +--- + +Back to: [Publishing the Final Release](publishing-final-release.md) diff --git a/docs/releasing/planning-updates.md b/docs/releasing/planning-updates.md new file mode 100644 index 00000000..f5dab30e --- /dev/null +++ b/docs/releasing/planning-updates.md @@ -0,0 +1,194 @@ +# Planning Package Updates + +Before each release of E3SM-Unified, the Infrastructure Team works with the +broader E3SM community to decide which packages and versions should be +included. This planning stage helps ensure compatibility across tools and +supports evolving analysis and diagnostic workflows. + +*Note: Access to Confluence and Slack is limited to E3SM/BER collaborators.* + +--- + +## Where Planning Happens + +* **Confluence Discussion Page**: Most planning takes place on the "Next + Version" page in the internal E3SM Documentation space. For those with + access, use [this link](https://e3sm.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#Next-versions) +* **GitHub Issues/PRs**: Occasionally, suggestions or discussions take place + in issues or pull requests on the + [E3SM-Unified GitHub repository](https://github.com/E3SM-Project/e3sm-unified). + This is the main avenue for community members without access to E3SM's + Confluence pages. +* **Slack (`#e3sm-help-postproc`)**: For quick suggestions or discussion + prompts for those with access to E3SM's or BER's Slack spaces. + +--- + +## Types of Updates + +### βœ… New Packages + +* Tools that have become important to E3SM workflows +* Visualization, diagnostics, or file conversion utilities + +### ⬆️ Version Updates + +* Upgrading packages already in the environment to more recent releases +* Ensuring compatibility with latest E3SM output formats or Python versions + +### ❌ Package Removal + +* Rare, but sometimes necessary for deprecated tools or packages no longer + maintained + +--- + +## Making Suggestions + +The best way to suggest a package or version change: + +1. Edit the **Confluence table** for the upcoming version (if you have access) +2. If not, open an issue on GitHub with your suggestion and rationale +3. Optional: Tag maintainers or post on Slack to coordinate + +When requesting a new package, please include: + +* Package name and version +* Maintainer or expert point of contact (if known) +* Why it's useful for E3SM workflows + +--- + +## Final Selection + +The final list of packages is curated by the Infrastructure Team based on: + +* Compatibility +* Stability of upstream packages +* Success in testing +* Community need and usage patterns + +Once the list is mostly settled, the team begins creating release candidates +using the [Creating Release Candidates](creating-rcs.md) workflow. + +--- + +## πŸ“¦ Managing Version Pins During Conda-Forge Migrations + +E3SM-Unified often needs to coordinate with conda-forge's centralized version +pinning. Many packages used by E3SM-Unified are governed by +[global pins](https://conda-forge.org/docs/maintainer/pinning_deps/) (exact +required versions) in: + +* [`conda_build_config.yaml`](https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/main/recipe/conda_build_config.yaml) + +### πŸ”€ What Happens During a Migration? + +Conda-forge frequently upgrades pinned versions of core libraries (e.g., +`hdf5`, `libnetcdf`, `proj`) via version migrations. These are tracked under: + +* [`migrations/`](https://github.com/conda-forge/conda-forge-pinning-feedstock/tree/main/recipe/migrations/) + +For example, the migration to `hdf5` 1.14.6 is described here: + +* [`hdf51146.yaml`](https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/c78051a2495698e9e612860efe058eb7e39fc528/recipe/migrations/hdf51146.yaml) + +### ⚠ Why This Matters + +If your dependencies are built against different versions of a migrating +library, you can end up with **incompatible binary builds** that silently fail +or break at runtime β€” especially with low-level C/C++ or Fortran dependencies. + +### 🧠 How to Handle It + +During planning, for any core dependency that is pinned: + +1. Check if it's listed in a migration YAML file. +2. Determine how far the migration has progressed on the + [conda-forge status page](https://conda-forge.org/status/): + + * If **all** of E3SM-Unified’s dependencies have adopted the new version, + use it. + * If **none** have, stick with the current version in + `conda_build_config.yaml`. + * If **some** have and some haven’t: + βž” You must *freeze to the old version* and *manually rebuild* any + migrated packages against that version (typically merging to a branch + on the conda-forge feedstock other than `main`). + +**Note:** Some packages in E3SM-Unified directly depend on pinned libraries +like `hdf5` or `libnetcdf` β€” that is, these pinned packages are +*dependencies of E3SM-Unified's dependencies*. + +If E3SM-Unified requires an older version of such a package (e.g., `nco` or +`moab`), and that version has only been built on conda-forge with an older +version of the pinned library, you may encounter compatibility issues during +the build. + +In these cases, it is often easier to upgrade the E3SM-Unified dependency to +a newer version that was built against the newer pinned library β€” as long as +that version is still compatible with the rest of the environment. This avoids +the complexity of manually rebuilding older versions with newer core libraries. + +### πŸŒ€ Multiple Migrations + +When multiple overlapping migrations are in progress (e.g., `hdf5`, +`libnetcdf`), assess each separately but prioritize compatibility. This is +often one of the trickiest parts of managing an E3SM-Unified release candidate. + +### πŸ“¦ E3SM-Unified Dependencies Affected by Conda-Forge Pins + +The ones in **bold** are those where we provide pins of our own and special care +must be taken. The rest either are unconstrained in E3SM-Unified or use version +constraints without strict pins (i.e., `>=` or `<` rather than an exact version), +so less care is required. + +* ffmpeg +* **hdf5** +* **libnetcdf** +* numpy +* **proj** +* **python** +* scipy + +This list may evolve from release to release as new packages are added or +pinned more strictly. + +--- + +## πŸ”„ Other Places That Require Updates + +In addition to updating versions in `meta.yaml` and the conda-forge feedstocks, +the following deployment-related files should also be kept in sync: + +### πŸ“¦ `e3sm_supported_machines/default.cfg` + +This file specifies the versions of key packages (both Spack-built and +Conda-installed) used during deployment. + +**Best Practice:** Package versions listed in `default.cfg` should typically +match the versions in `recipes/e3sm-unified/meta.yaml`, unless there's a clear +technical reason to diverge (e.g., system module incompatibilities or build +issues with a newer version). + +**ESMPy:** We have found that `ESMPy` build with system compilers is not +compatible with `xesmf`, which is used by `xcdat` and `e3sm_diags`. As a +results, the current best practice is to set `esmpy = None` in `default.cfg`. + +Maintainers should update these entries as part of planning and testing a new +release candidate. + +### πŸ› οΈ `e3sm_supported_machines/shared.py` + +The version of E3SM-Unified being deployed is hard-coded here: + +```python +parser.add_argument("--version", dest="version", default="1.11.1", + help="The version of E3SM-Unified to deploy") +``` + +This value should be updated manually for **each new release candidate** and +final release to reflect the current version being tested or deployed. + +> 🧩 Note: We plan to automate this step in the future, but for now it must be +updated manually. diff --git a/docs/releasing/release-workflow.md b/docs/releasing/release-workflow.md new file mode 100644 index 00000000..d659d9c3 --- /dev/null +++ b/docs/releasing/release-workflow.md @@ -0,0 +1,102 @@ +# The E3SM-Unified Release Workflow + +Releasing a new version of E3SM-Unified is an iterative, collaborative process +involving Conda, Spack, and coordinated deployment across HPC systems. This +guide serves as a roadmap for each stage in the workflow. + +Whether you're updating packages, building release candidates, or testing on +HPC platforms, this section documents the steps needed to bring a new version +of E3SM-Unified from planning to full deployment. + +--- + +## Overview of the Workflow + +The release process typically follows this progression: + +1. **[How Conda and Spack Work Together in E3SM-Unified](conda-vs-spack.md)** +2. **[Planning Package Updates](planning-updates.md)** +3. **[Creating Release Candidates](creating-rcs/overview.md)** +4. **[Deployment and Testing](testing/overview.md)** +6. **[Finalizing the Release](finalizing-release.md)** + +Each of these steps is detailed in its own page. See below for a high-level +summary. + +--- + +## 1. How Conda and Spack Work Together in E3SM-Unified + +Why does E3SM-Unified use both Conda and Spack? What roles do they each serve? +Before you start, it's critical to understand how these two systems work +together. + +πŸ”— [Read more](conda-vs-spack.md) + +--- + +## 2. Planning Package Updates + +Updates are driven by the needs of the E3SM community, typically discussed via +Confluence or GitHub. This step documents how to propose new packages or +changes to existing ones. + +πŸ”— [Read more](planning-updates.md) + +--- + +## 3. Creating Release Candidates + +This step covers: + +* Making RCs for core tools (e.g., E3SM Diags, MPAS-Analysis, zppy) +* Building an `e3sm-unified` RC + +πŸ”— [Read more](creating-rcs/overview.md) + +--- + +## 4. Deploying and Testing on HPCs + +Before full deployment, release candidates are installed on a subset of HPC +platforms for iterative testing and validation. This stage often requires +extensive coordination and may involve debugging and extending the Spack build +workflow, updating the E3SM Spack fork, and modifying `mache` to support new +systems or changes in machine configurations. + +Testing includes everything from basic imports to full `zppy` workflows. This +is a collaborative effort, with the full iterative process often spanning +several weeks to a few months. + +πŸ”— [Read more](testing/overview.md) + +--- + +## 5. Adding a New Machine + +Most of the work for adding a new machine takes place in `mache`. Here we +provide notes on adding new HPCs that are specific to E3SM-Unified. + +πŸ”— [Read more](adding-new-machines.md) + +--- + +## 6. Finalizing the Release + +Once all RCs pass testing: + +* Make final releases of all dependencies +* Publish the final E3SM-Unified conda package +* Deploy across all supported HPC machines +* Announce the release to the community + +πŸ”— [Read more](finalizing-release.md) + +--- + +## 7. Maintaining Past Versions + +Older versions of E3SM-Unified sometimes require maintenance (repairs or +deletion). + +πŸ”— [Read more](maintaining-past-versions.md) diff --git a/docs/releasing/testing/deploying-on-hpcs.md b/docs/releasing/testing/deploying-on-hpcs.md new file mode 100644 index 00000000..703ed8e7 --- /dev/null +++ b/docs/releasing/testing/deploying-on-hpcs.md @@ -0,0 +1,224 @@ +# Deploying on HPCs + +Once a release candidate of E3SM-Unified is ready, it must be deployed and +tested on HPC systems using a combination of Spack and Conda-based tools. +Deployment scripts and configurations live within the `e3sm_supported_machines` +directory of the E3SM-Unified repo. + +This document explains the deployment workflow, what needs to be updated, and +how to test and validate the install. + +--- + +## Deployment Components + +Deployment happens via the following components: + +### πŸ”§ `deploy_e3sm_unified.py` + +* The main entry point for deploying E3SM-Unified +* Installs the combined Conda + Spack environment on supported systems +* Reads deployment config from `default.cfg` and shared logic in `shared.py` + +You can find the full list of command-line flags with: + +```bash +./deploy_e3sm_unified.py --help +``` + +You must supply `--conda` at a minimum. This is the path to a conda +installation (typically in your home directory) where the deployment tool +can create a conda environment (`temp_e3sm_unified_install`) used to install +E3SM-Unified. This environment includes the `mache` package, which can +automatically recognize the machine you are on and configure accordingly. + +For release builds (but not release candidates), you should supply +`--release`. If this flag is **not** supplied, the activation scripts +created during deployment will start with `test_e3sm_unified_...` whereas +the release versions will be called `load_latest_e3sm_unified_...` and +`load_e3sm_unified_...`. + +Other flags are optional and will be discussed below. + +### πŸ“ `default.cfg` + +* Specifies which packages and versions to install via Spack as well as the + versions of some conda packages required in the installation environment + (notably `mache`) +* Version numbers here should match `meta.yaml` unless diverging for a reason +* A special case is `esmpy = None`, required so ESMPy comes from conda-forge, + not Spack. + +### βš™οΈ `shared.py` + +* Contains logic shared between `deploy_e3sm_unified.py` and `bootstrap.py` +* Defines the version of E3SM-Unified to deploy (hard-coded) + +### 🧰 `bootstrap.py` + +* Used by `deploy_e3sm_unified.py` to build and configure environments once + the `temp_e3sm_unified_install` conda environment has been created. + +### πŸ§ͺ Templates + +The `e3sm_supported_machines/templates` subdirectory contains jinja2 templates +used during deployment. + +* Build script template: + + * `build.template`: Used during deployment to build and install versions of + the following packages using system compilers and MPI (if requested): + + ```bash + mpi4py + ilamb + esmpy + xesmf + ``` + + * Maintainers may need to add new packages to the template over time. + Typically, the dependencies here are python-based but use system compilers + and/or MPI. Spack must not install Python itself, as this would conflict + with the Conda-managed Python environment. All Python packages need to be + installed into the Conda environment. + +* Activation script templates: + + * `load_e3sm_unified.sh.template` + * `load_e3sm_unified.csh.template` + * Since E3SM itself cannot be built when E3SM-Unified is active, these + scripts set: + + ```bash + CIME_MODEL="ENVIRONMENT_RUNNING_E3SM_UNIFIED_USE_ANOTHER_TERMINAL" + ``` + + This is supposed to tell users that they cannot build E3SM with this + terminal window (because E3SM-Unified is loaded) and they should open + a new one. Some users have not found this very intuitive but we don't + currently have a better way for E3SM to detect that E3SM-Unified is active. + * These scripts also detect whether the user is on a compute or login node + via `$SLURM_JOB_ID` or `$COBALT_JOBID` environment variables (which should + only be set on compute nodes). + * Maintainers will need to edit these scripts to support new queuing systems + (e.g. PBS). + +--- + +## Typical Deployment Steps + +1. **Update config files**: + + * Set the target version in `shared.py` + * Update `default.cfg` with package versions (Spack + Conda) + * Update `mache` config files (see [Updating `mache`](mache-updates.md)) + +2. **Test the build** on one or more HPC machines: + + ```bash + cd e3sm_supported_machines + ./deploy_e3sm_unified.py --conda ~/miniforge3 + ``` + + **Note:** This can take a lot of time. If the connection to the HPC machine + is not stable, you should use `screen` or similar to preserve your + connection and you should pipe the output to a log file, e.g.: + + ```bash + ./deploy_e3sm_unified.py --conda ~/miniforge3 | tee deploy.log + ``` + + **Note:** It is not recommended that you try to deploy E3SM-Unified + simultaneously on two different machines that share the same base conda + environment (e.g. Anvil and Chrysalis). The two deployments will step on + each other's toes. + +3. **Check terminal output** and validate that: + + * Spack built the expected packages + * Conda environment was created and activated + * Activation scripts were generated and symlinked correctly + * Permissions have been updated successfully (read only for everyone + except the E3SM-Unified maintainer) + +4. **Manually test** tools in the installed environment + + * Load via: `source test_e3sm_unified__.sh` + * Run tools like `zppy`, `e3sm_diags`, `mpas_analysis` + +5. **Deploy more broadly** once core systems pass testing + +--- + +## Optional flags to `deploy_e3sm_unified.py` + +Here, we start with the flags that a mainainer is most likely to need, with +less useful flags at the bottom. + +* `--recreate`: Rebuilds the Conda environment if it already exists. This will + also recreate the installation environment `temp_e3sm_unified_install`. + + Note: This will **not** rebuild Spack packages from scratch. To do that, + manually delete the corresponding Spack directory before running the + deployment script again. These directories are typically located under: + + ``` + spack/e3sm_unified____ + ``` + +* `--mache_fork` and `--mache_branch`: It is common to need to co-develop + E3SM-Unified and `mache`, and it is impractical to tag a release candidate + and build the associated conda-forge package every time. Instead, use these + flags to point to your fork and branch of `mache` to install into both + the installation and testing `conda` environments. **Do not use this + for release deployments.** + +* `--tmpdir`: Set the `$TMPDIR` environment variable for Spack to use in case + `/tmp` is full or not a desirable place to install. + +* `--version`: Typically you want to deploy the latest release candidate or + release, which should be the hard-coded default. You can set this to + a different value to perform a deployment of an earlier version if needed. + +* `--python`: Deploy with a different version of python than specified in + `default.cfg` + +* `-m` or `--machine`: Specify the machine if `mache` did not detect it + correctly for some reason. + +* `-c` or `--compiler`: Specify a different compiler than the default. To + determine the default compiler, find the machine under + [mache's machine config files](https://github.com/E3SM-Project/mache/tree/main/mache/machines). + To determine which other compilers are supported, look at the list of + [mache spack templates](https://github.com/E3SM-Project/mache/tree/main/mache/spack/templates) + (`yaml` files). + +* `-i` or `--mpi`: Similar to compilers, use this flag to specify an MPI + variant other than the default. As above, you can determine the defaults + and supported alternatives by looking in the configs and templates in + `mache`. + +* `-f` or `--config_file`: You can provide a config file to override defaults + from `default.cfg` or the config file for the specific machine from `mache`. + Use this with caution because this approach will be hard for other + maintainers to reproduce in the future. + +* `--use_local`: Typically not useful but can be used in a pinch if you have + built conda package locally in the installation you pointed to with `--conda` + and want to use them in the deployment. + +--- + +## Notes for Maintainers + +* A partial deployment is expected during RC testing; not all systems must be + built initially. Chrysalis and Perlmutter are good places to start. +* Always ensure that the E3SM spack fork has a `spack_for_mache_` + branch (e.g. `spack_for_mache_1.32.0`) for the version of `mache` you are + testing (e.g. `mache` 1.32.0rc1). +* Be aware of potential permission or filesystem issues when writing to + shared software locations. + +--- + +➑ Next: [Troubleshooting Deployment](troubleshooting-deploy.md) diff --git a/docs/releasing/testing/mache-updates.md b/docs/releasing/testing/mache-updates.md new file mode 100644 index 00000000..1ba37b2c --- /dev/null +++ b/docs/releasing/testing/mache-updates.md @@ -0,0 +1,156 @@ +# Updating `mache` + +`mache` is the configuration library used by E3SM-Unified (and related +projects like Polaris and Compass) to determine machine-specific settings, +including module environments and Spack configurations. + +During each E3SM-Unified release, it is often necessary to: + +* Add support for new machines +* Update Spack environment templates for existing systems +* Create release candidates and final versions of `mache` + +This page outlines the steps for maintaining and updating `mache` during the +release process. + +--- + +## Repo Location + +πŸ”— [https://github.com/E3SM-Project/mache](https://github.com/E3SM-Project/mache) + +--- + +## When to Update `mache` + +You should update `mache` when: + +* A supported machine has changed modules or compilers +* New machines are being targeted for deployment +* Spack YAML templates fall out of sync with system configurations +* You need to test new combinations of compiler + MPI + module environments + +Each change should be tested by deploying a release candidate of E3SM-Unified. + +--- + +## Key Tasks + +### 1. Update config options + +Each HPC machine supported by E3SM-Unified has a +[config file in `mache`](https://github.com/E3SM-Project/mache/tree/main/mache/machines). + +The config file has a section `[e3sm_unified]`, e.g.: + +```cfg +# Options related to deploying an e3sm-unified conda environment on supported +# machines +[e3sm_unified] + +# the unix group for permissions for the e3sm-unified conda environment +group = cels + +# the compiler set to use for system libraries +compiler = gnu + +# the system MPI library +mpi = openmpi + +# the path to the directory where activation scripts, the base environment, and +# system libraries will be deployed +base_path = /lcrc/soft/climate/e3sm-unified + +# whether to use E3SM modules for hdf5, netcdf-c, netcdf-fortran and pnetcdf +# (spack modules are used otherwise) +use_e3sm_hdf5_netcdf = False +``` + +These config options control the default deployment behavior, including the +Unix `group` that the E3SM-Unified environment will belong to, the +`compiler` and `mpi` library used to build E3SM-Unified Spack packages by +default, The `base_path` under which the conda and spack environments as well +as the activation scripts will be installed, and whether that machine will +use E3SM's version of `hdf5`, `netcdf-c`, `netcdf-fortran`, `parallel-netcdf`, +etc. or install them from Spack. + +### 2. Edit Spack Templates + +Spack environment templates live in: + +``` +mache/spack/templates/__.yaml +``` + +Edit these files to reflect updated system modules or new toolchains. +If adding a new machine, copy an existing `yaml` file to use as a template. + +Use the utility script to assist: +πŸ”— [utils/update_cime_machine_config.py README](https://github.com/E3SM-Project/mache/blob/main/utils/README.md) + +This script can be used to download the latest version of the +`config_machines.xml` file from E3SM's master branch, then compare it to the +previous version stored in `mache`, showing changes related to supported +machines. + +You should make the changes associated with the differences that this utility +displays in the appropriate `mache/spack/templates` files. You should then copy `new_config_machines.xml` into `mache/cime_machine_config/config_machines.xml` +as the new reference set of machine configurations that `mache` is in sync +with. + +--- + +### 3. Create a Release Candidate + +Use the typical GitHub flow: + +```bash +git checkout -b update-to-1.32.0 +# Make changes +# Push branch and open PR +``` + +Once the PR is reviewed and merged: + +* Tag a release candidate (e.g., `1.32.0rc1`) +* Publish it to conda-forge under `mache_dev` (by merging a PR that targets + the `dev` branch) + +This RC will be referenced in the E3SM-Unified build process. + +**Note:** As we will discuss later, it is also possible to test E3SM-Unified +with a development branch of `mache` available on GitHub. However, it is +always cleaner to use a release candidate. + +--- + +### 4. Finalize the Release + +Once testing across all platforms is complete: + +* Create a final version tag (e.g., `1.32.0`) +* Always use [semantic versioning](https://semver.org/) +* Submit a PR to `mache-feedstock` to update the recipe (this time targeting + the `main` branch) +* Merge once CI passes + +Afterward, update any references to the RC version in the E3SM-Unified repo to +point to the final release. + +--- + +## Best Practices + +* Be liberal in what system tools (`tar`, `CMake`, etc.) are defined as + `buildable: false` in Spack environments. Anything Spack doesn't have to + build saves time and avoids potential build errors due to inconsistent + toolchain assumptions. +* Regularly sync templates with actual E3SM production configurations +* Validate changes via test deployments of E3SM-Unified (or Polaris or Compass) + before tagging final versions. +* New mache releases will need to be made as needed by any of the + **downstream** repos β€” currently E3SM-Unified, Polaris, and Compass. + +--- + +➑ Next: [Deploying on HPCs](deploying-on-hpcs.md) diff --git a/docs/releasing/testing/overview.md b/docs/releasing/testing/overview.md new file mode 100644 index 00000000..ae5bdf49 --- /dev/null +++ b/docs/releasing/testing/overview.md @@ -0,0 +1,72 @@ +# Deployment and Testing Overview + +Once a release candidate (RC) of E3SM-Unified has been successfully built, it +must be thoroughly tested across supported HPC systems before a full release +can occur. This phase ensures compatibility with system modules, +performance-critical tools, and real-world analysis workflows. + +This section documents the full testing and deployment process, including how +to: + +* Update the E3SM Spack fork to support new versions +* Maintain and release new versions of `mache` for system-specific Spack + configurations +* Deploy RCs and full releases of E3SM-Unified on supported HPC platforms +* Identify and resolve deployment issues + +--- + +## Phased Deployment Strategy + +Testing typically begins with a **partial deployment** of an E3SM-Unified RC +to a few key HPC systems. Once core functionality and package compatibility +are verified, a **full deployment** to all supported machines is performed. + +Each iteration involves collaboration between the Infrastructure Team and tool +maintainers to: + +* Validate that tools like `zppy`, `e3sm_diags`, and `mpas-analysis` run + correctly +* Confirm compatibility with system MPI, compilers, and Python versions +* Identify mismatches or conflicts in environment resolution + +--- + +## Key Components of the Deployment Process + +The following steps and infrastructure are used when testing and deploying a +new release: + +### πŸ› οΈ [Updating the E3SM Spack Fork](spack-updates.md) + +* Add new versions of performance-critical tools (e.g., NCO, ESMF, MOAB) +* Create `spack_for_mache_` branches for use in `mache` + +### 🧩 [Updating `mache`](mache-updates.md) + +* Keep system-specific Spack environment templates in sync with E3SM module + stacks +* Create RC and final releases of `mache` +* Use `utils/update_cime_machine_config.py` to streamline updates + +### πŸš€ [Deploying on HPCs](deploying-on-hpcs.md) + +* Use the `deploy_e3sm_unified.py` script and template infrastructure in + `e3sm_supported_machines` +* Build environments and activation scripts tailored to each system + +### πŸ§ͺ [Troubleshooting Deployment Issues](troubleshooting-deploy.md) + +* Resolve Spack build failures and MPI/compiler mismatches +* Address problems with activation, modules, or symbolic links +* Common pitfalls in `default.cfg` or `shared.py` configuration + +--- + +## Audience + +This section is primarily intended for E3SM-Unified maintainers and release +engineers. Familiarity with Spack, Conda, and HPC system environments is +assumed. + +➑ Start with: [Updating the E3SM Spack Fork](spack-updates.md) diff --git a/docs/releasing/testing/spack-updates.md b/docs/releasing/testing/spack-updates.md new file mode 100644 index 00000000..29c80450 --- /dev/null +++ b/docs/releasing/testing/spack-updates.md @@ -0,0 +1,122 @@ +# Updating the E3SM Spack Fork + +E3SM-Unified relies on a custom fork of Spack to build performance-critical +software components that are not managed by Conda. This fork includes +specialized packages (e.g., `moab`, `tempestremap`, `esmf`) and system-aware +configurations to support a wide range of HPC environments. + +This page outlines the steps for updating and managing the E3SM Spack fork +during an E3SM-Unified release cycle. + +--- + +## Repo Location + +The E3SM Spack fork lives at: +πŸ”— [https://github.com/E3SM-Project/spack](https://github.com/E3SM-Project/spack) + +--- + +## Key Tasks + +### 1. Add or Update Package Versions + +You may need to: + +* Add new versions of packages like `nco`, `moab`, `esmf`, `tempestremap`, etc. +* Update build configurations, variants, or patches +* rebase onto new releases of the main [spack repo](https://github.com/spack/spack) + +Follow Spack’s standard packaging conventions. Builds will typically be tested +as part of E3SM-Unified deployment (or deployment of Polaris or Compass), so +no other testing is typically necessary or practical. + +After changes are validated, push them to the appropriate branch or branches +(see next section). + +--- + +### 2. Create `spack_for_mache_` Branches + +The main development branch on E3SM's spack for is `develop`. Each release of +`mache` also references a specific Spack branch named: + +``` +spack_for_mache_ +``` + +Example: + +``` +spack_for_mache_1.32.0 +``` + +To create one from a local clone of the E3SM spack repo: + +```bash +git checkout develop +git checkout -b spack_for_mache_1.32.0 +git push origin spack_for_mache_1.32.0 +``` +This ensures that the version of `mache` used for deployment has a stable and +reproducible Spack reference. During development of a `mache` version, this +also let you make potentially breaking changes to `spack_for_mache_` +for testing without breaking the `develop` branch. (Make sure to always push +your changes to `origin` so they are available during E3SM-Unified deployment.) + +**Note**: Your `spack_for_mache_` branch name should not include +`rc` even if you are testing a release candidate of `mache` as part of your +E3SM-Unified deployment. The deployment scripts automatically strip off the +`rc` part when determining the name of the appropriate spack branch. + +Once you have a relatively stable `spack_for_mache_` branch, you can +push the changes you have made to `develop` so they are available for future +`mache` versions and other users of E3SM's spack fork. + +```bash +git checkout develop +git reset --hard spack_for_mache_1.32.0 +git push origin develop +``` +Please be careful not to use `git push --force` here. You should only be +adding new commits, not changing the history of `develop`. + +### 3. Rebasing `develop` onto Spack Releases + +One important maintenance task for the E3SM Spack fork is to keep it up-to-date +with the [main Spack repo](https://github.com/spack/spack). This requires +interactively rebasing the `develop` branch onto the release, interactively +selecting only commits authored within the E3SM Spack fork (i.e., excluding +upstream Spack commits), and troubleshooting any merge conflicts that arise. + +Because this will involve a force-push, it is important to coordinate with +other users of the fork. Make an issue similar to +[this exampe](https://github.com/E3SM-Project/spack/issues/36) and ping +relevant developers to arrange a good time for the update. + +```bash +git checkout develop +git remote add spack/spack git@github.com:spack/spack.git +git fetch --all -p +git rebase -i spack/spack/v0.23.1 +# edit the list of commits so the first is "Add v2.1.0 to v2.1.6 to TempestRemap" +git push --force origin develop +``` + +You may wish to perform the rebase using a new branch (e.g., +`rebase-onto-v0.23.1`) that you can point to in the issue you post to +coordinate with other developers. This way, you can ask for guidance if you +are unsure about the way you resolved any merge conflicts that arose. + +--- + +## Best Practices + +* Keep `develop` clean and stable β€” avoid experimental changes +* Use branches to track specific `mache` releases +* Coordinate with other E3SM package maintainers when rebasing the `develop` + branch or updating shared packages + +--- + +➑ Next: [Updating `mache`](mache-updates.md) diff --git a/docs/releasing/testing/troubleshooting-deploy.md b/docs/releasing/testing/troubleshooting-deploy.md new file mode 100644 index 00000000..e8e2dfcd --- /dev/null +++ b/docs/releasing/testing/troubleshooting-deploy.md @@ -0,0 +1,149 @@ +# Troubleshooting Deployment + +Even with well-maintained tools, deployment of E3SM-Unified on HPC systems +often encounters system-specific or environment-specific problems. This page +outlines common categories of issues and how to diagnose and resolve them. + +This is an evolving list. Please make PRs to add descriptions of issues +you have encountered and solutions you have found. + +--- + +## 1. πŸ› οΈ Spack Build Failures + +### Common Causes + +* Missing or incompatible system modules (`cmake`, `perl`, `bison`, etc.) +* Outdated Spack package definitions in the `spack_for_mache_` + branch on the E3SM fork +* Spack build cache pollution +* Environment not set correctly for Spack to detect compilers/libraries + +### Solutions + +* If Spack is attempting to build common system tools (`cmake`, `tar`, etc.), + add their system versions to the Spack templates in `mache` with + `buildable: false` instead to save time and prevent build problems. +* Check with `spack find`, `spack config get compilers`, and + `spack config get modules` +* Load required modules manually before re-running +* Rebuild: `spack uninstall -y ` or delete the full deployment + directory +* Double-check you are using the correct `spack_for_mache_` branch + +--- + +## 2. πŸ”’ Activation Script or Module Issues + +### Symptoms + +* Scripts not found or symlinks broken +* Compute node not detected + +### Fixes + +* Inspect Jinja2 templates for logic errors (especially for new systems) +* Re-run deployment with `--recreate` +* Validate compute node detection logic (`$SLURM_JOB_ID`, `$COBALT_JOBID`, + etc.) +* For new schedulers (e.g., PBS), extend template logic accordingly + +--- + +## 3. 🚫 Conda Environment Problems + +### Symptoms + +* Conda fails to resolve dependencies +* Environments install but are missing key packages + +### Fixes + +* Run with `--recreate` to force a rebuild +* Inspect logs carefully for root cause messages +* Use `recipes/e3sm-unified/conda_first_failure.py` to bisect failing specs +* Check for channel mismatches or conflicting dev-label dependencies + +--- + +## 4. πŸ’Ύ Filesystem and Permission Issues + +### Symptoms + +* Scripts not executable by collaborators +* Environment directories not group-readable + +### Fixes + +* Run: `chmod -R g+rx` and `chgrp -R ` as needed +* Confirm deployment messages show permission updates succeeded +* Use `ls -l` to inspect group ownership and mode bits +* You may need to coordinate with administrators or previous maintainers to + set permissions (e.g. if you do not have write permission to contents + under the E3SM-Unified base environment) + +--- + +## 5. 🧰 `mache` Configuration Problems + +### Symptoms + +* Unknown machine error during deployment +* Spack fails to load environment due to incorrect module list + +### Fixes + +* Ensure the correct `mache` version or branch is being installed +* Ensure that the machine has been added to `mache` both under + [machine config files](https://github.com/E3SM-Project/mache/tree/main/mache/machines) + and in the logic for + [machine discovery](https://github.com/E3SM-Project/mache/blob/main/mache/discover.py) +* Validate updates to `config_machines.xml` and spack YAML templates +* Use `utils/update_cime_machine_config.py` to compare against upstream E3SM + config + +--- + +## 6. πŸͺ– Spack Caching and Environment Contamination + +### Symptoms + +* Builds complete but produce incorrect or stale binaries +* Environment behaves inconsistently between deploys + +### Fixes + +* Clear Spack caches manually if needed +* Always deploy from a clean `$TMPDIR` and fresh clone if unsure +* Delete the entire directory: + + ```bash + rm -rf spack/e3sm_unified____ + ``` + +--- + +## 7. ⚠️ Common Fix: Full Clean + Re-run + +When in doubt, remove and rebuild everything: + +```bash +rm -rf /spack/e3sm_unified____ +./deploy_e3sm_unified.py --conda ~/miniforge3 --recreate +``` + +This often resolves cases where previous state is interfering with a clean +build. + +--- + +## πŸ“Ž Related Tools & Tips + +* Use `screen`, `tmux`, or `nohup` for long deployments +* Always log output: `... | tee deploy.log` +* Validate final symlinks and paths manually after deployment +* Document which system + compiler + MPI variants have been tested + +--- + +Back to: [Deploying on HPCs](deploying-on-hpcs.md) diff --git a/docs/testing-release-candidates.md b/docs/testing-release-candidates.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 00000000..3917b0ec --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,74 @@ +# Troubleshooting & FAQs + +This page collects common problems users encounter when working with +E3SM-Unified and how to resolve them. If you encounter an issue not listed +here, please reach out via Slack or GitHub. + +--- + +## Common Issues + +### "Permission Denied" When Installing E3SM-Unified + +**Symptom:** + +``` +OSError(13, 'Permission denied') +``` + +**Cause:** You're likely trying to install E3SM-Unified into a system-wide +Python or Conda environment you don't have write access to. + +**Solution:** +Install Miniforge3 in your home directory and create an environment locally, +see [Quickstart Guide](quickstart.md). + +--- + +### "Module Not Found" When Importing Packages + +**Symptom:** + +```python +ModuleNotFoundError: No module named 'e3sm_diags' +``` + +**Cause:** E3SM-Unified may not be activated correctly, or you're in a +different shell/session. + +**Solution:** +Re-source the appropriate `load_latest_e3sm_unified_.sh` script and +retry. + +--- + +### MPI-based Tools Fail on Login Nodes + +**Symptom:** Tools like `mpas_analysis` or `nco` crash with MPI errors. + +**Cause:** These tools are compiled with system MPI (or launch other tools +that use system MPI) and require execution on compute nodes. + +**Solution:** Launch a batch job or an interactive compute session with +`srun`, `salloc` or `qsub`, depending on your machine. + +--- + +## Tips & Best Practices + +* Always check you're in the correct conda environment. +* On HPC systems, prefer running MPI-enabled tools on compute nodes. +* If installing locally, make sure have create a clean environment with the + latest version of E3SM-Unified. +* Refer to the [Quickstart Guide](quickstart.md) for environment setup + instructions. + +--- + +## Still Need Help? + +```{admonition} Support +- Slack: #e3sm-help-postproc +- GitHub Issues: [E3SM-Unified on GitHub](https://github.com/E3SM-Project/e3sm-unified/issues) +- Email: xylar@lanl.gov +``` diff --git a/docs/using-toos.md b/docs/using-toos.md new file mode 100644 index 00000000..cd4bde49 --- /dev/null +++ b/docs/using-toos.md @@ -0,0 +1,74 @@ +# Using E3SM-Unified Tools + +This section provides usage tips and examples for the most common tools +included in the E3SM-Unified environment. It is aimed at both new and +experienced users who want to understand how to apply these tools for analysis, +diagnostics, and processing tasks. + +--- + +## MPAS-Analysis + +> TODO: Add examples for configuring and running MPAS-Analysis, and interpreting output. + +* Description +* Typical use cases +* Running on HPC compute nodes +* Tips for customizing config files + +--- + +## E3SM Diags (`e3sm_diags`) + +> TODO: Add instructions and sample runs for generating diagnostics from E3SM output. + +* Description +* Example input/output +* Viewer and web-ready outputs + +--- + +## e3sm\_to\_cmip + +> TODO: Describe how to use this tool to convert E3SM output to CMIP-compliant NetCDF. + +* Key arguments and configurations +* Output structure +* Known caveats and workarounds + +--- + +## zppy & zppy-interfaces + +> TODO: Guide users through setting up and running zppy workflows. + +* YAML config examples +* Common tasks (`ts`, `e3sm_to_cmip`, `tc_analysis`, etc.) +* New features in zppy-interfaces + +--- + +## zstash + +> TODO: Describe how to archive and retrieve data efficiently using zstash. + +* Parallel archiving best practices +* Example usage +* Non-blocking mode with Globus + +--- + +## Additional Tools + +> TODO: Brief overviews or links to additional packages: + +* `nco`, `cdo` +* `cime_gen_domain`, `mosaic`, `livvkit` +* `tempest-remap`, `moab`, `geometric_features` + +--- + +## Related Pages + +* [Quickstart Guide](quickstart.md) +* [Package Catalog](packages.md) diff --git a/recipes/e3sm-unified/conda_first_failure.py b/recipes/e3sm-unified/conda_first_failure.py index b60940b5..2970880e 100755 --- a/recipes/e3sm-unified/conda_first_failure.py +++ b/recipes/e3sm-unified/conda_first_failure.py @@ -1,113 +1,107 @@ #!/usr/bin/env python + import subprocess +import argparse + + +def parse_specs(specfile): + """Parse the spec file, returning a list of specs.""" + specs = [] + with open(specfile, "r") as f: + for line in f: + raw = line.strip() + if not raw or raw.startswith("#"): + continue + if "{{" in raw or "}}" in raw: + raise ValueError( + "Jinja2 templating ({{ or }}) found in spec file; this " + "is not supported." + ) + # Remove leading '-' and whitespace + spec = raw.lstrip('-').strip() + # Remove trailing comments + if '#' in spec: + spec = spec.split('#', 1)[0].strip() + if spec: + specs.append(spec) + return specs + + +def find_first_failure(specs, base_command, timeout): + highest_valid = -1 + lowest_invalid = len(specs) + end_index = len(specs) + last_failed_output = None + while highest_valid != lowest_invalid-1: + subset_specs = specs[0:end_index] + print(f'last: {subset_specs[-1]}') + + command = base_command + subset_specs + + try: + result = subprocess.run( + command, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + timeout=timeout, + text=True, + ) + output = result.stdout + if isinstance(output, bytes): + output = output.decode("utf-8", errors="replace") + if result.returncode == 0: + highest_valid = end_index-1 + end_index = int(0.5 + 0.5*(highest_valid + lowest_invalid)) + 1 + print(' Succeeded!') + else: + last_failed_output = output + lowest_invalid = end_index-1 + end_index = int(0.5 + 0.5*(highest_valid + lowest_invalid)) + 1 + print(' Failed!') + except subprocess.TimeoutExpired: + last_failed_output = "Timed out!" + lowest_invalid = end_index-1 + end_index = int(0.5 + 0.5*(highest_valid + lowest_invalid)) + 1 + print(' Failed!') + print( + f' valid: {highest_valid}, invalid: {lowest_invalid}, ' + f'end: {end_index}' + ) + return lowest_invalid, last_failed_output + + +def main(): + parser = argparse.ArgumentParser( + description="Find first failing conda package spec." + ) + parser.add_argument( + "specfile", + help="Path to file containing conda specs (one per line)" + ) + parser.add_argument( + "--timeout", + type=int, + default=240, + help="Timeout for each conda dry-run (seconds)" + ) + args = parser.parse_args() -specs = ['python=3.10', - 'ncvis-climate 2023.09.12', - 'libnetcdf 4.9.2 mpi_mpich_*', - 'chemdyg 0.1.4', - 'e3sm_diags 2.9.0', - 'e3sm_to_cmip 1.11.0', - 'geometric_features 1.2.0', - 'globus-cli >=3.15.0', - 'ilamb 2.7', - 'ipython', - 'jupyter', - 'livvkit 3.0.1', - 'mache 1.17.0', - 'moab 5.5.1', - 'mpas-analysis 1.9.1rc1', - 'mpas_tools 0.27.0', - 'nco 5.1.9', - 'pcmdi_metrics 2.3.1', - 'tempest-remap 2.2.0', - 'tempest-extremes 2.2.1', - 'xcdat 0.6.1', - 'zppy 2.3.1', - 'zstash 1.4.2rc1', - 'mpich', - 'blas', - 'bottleneck', - 'cartopy >=0.17.0', - 'cdat_info 8.2.1', - 'cdms2 3.1.5', - 'cdtime 3.1.4', - 'cdutil 8.2.1', - 'cmocean', - 'dask 2023.6.0', - 'dogpile.cache', - 'eofs', - 'esmf 8.4.2 mpi_mpich_*', - 'esmpy 8.4.2', - 'f90nml', - 'ffmpeg', - 'genutil 8.2.1', - 'globus-sdk', - 'gsw', - 'hdf5 1.14.2 mpi_mpich_*', - 'ipygany', - 'lxml', - 'matplotlib 3.7.1', - 'metpy', - 'mpi4py', - 'nb_conda', - 'nb_conda_kernels', - 'ncview 2.1.8', - 'ncvis-climate 2023.09.12', - 'netcdf4 1.6.4 nompi_*', - 'notebook <7.0.0', - 'numpy >1.13', - 'output_viewer 1.3.3', - 'pillow', - 'plotly', - 'progressbar2', - 'proj 9.3.1', - 'pyevtk', - 'pyproj 3.6.1', - 'pyremap', - 'pytest', - 'pywavelets', - 'scikit-image', - 'scipy >=0.9.0', - 'shapely', - 'sympy >=0.7.6', - 'tabulate', - 'xarray 2023.5.0', - 'xesmf', - 'cython', - 'cf-units >=2.0.0', - 'psutil', - 'pandas' - ] + specs = parse_specs(args.specfile) -base_command = ['conda', 'create', '-y', '-n', 'dry-run', '--dry-run', - '--override-channels', - '-c', 'conda-forge/label/mpas_analysis_dev', - '-c', 'conda-forge/label/zstash_dev', - '-c', 'conda-forge'] + base_command = ['conda', 'create', '-y', '-n', 'dry-run', '--dry-run'] -prevEnd = None -highestValid = -1 -lowestInvalid = len(specs) -endIndex = len(specs) -while highestValid != lowestInvalid-1: - subset_specs = specs[0:endIndex] - print('last: {}'.format(subset_specs[-1])) + lowest_invalid, last_failed_output = find_first_failure( + specs, base_command, args.timeout + ) - command = base_command + subset_specs + if lowest_invalid == len(specs): + print('No failures!') + else: + print(f'First failing package: {specs[lowest_invalid]}') + if last_failed_output: + print("\n--- Output from last failed attempt ---\n") + print(last_failed_output) - try: - subprocess.check_call(command, stdout=subprocess.DEVNULL) - highestValid = endIndex-1 - endIndex = int(0.5 + 0.5*(highestValid + lowestInvalid)) + 1 - print(' Succeeded!') - except subprocess.CalledProcessError: - lowestInvalid = endIndex-1 - endIndex = int(0.5 + 0.5*(highestValid + lowestInvalid)) + 1 - print(' Failed!') - print(' valid: {}, invalid: {}, end: {}'.format( - highestValid, lowestInvalid, endIndex)) -if lowestInvalid == len(specs): - print('No failures!') -else: - print('First failing package: {}'.format(specs[lowestInvalid])) +if __name__ == "__main__": + main()