Skip to content

Commit 539e670

Browse files
committed
add a post
1 parent 189982e commit 539e670

File tree

2 files changed

+87
-2
lines changed

2 files changed

+87
-2
lines changed

_posts/2021-07-29-E3SM-user-defined-test.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,15 @@ Before merging a Pull Request to master, the model integrators need to perform a
1414
- First create a branch off master then switch to that branch
1515
- Say we want to do a testing suite called "e3sm_land_developer", the next step is to create baseline cases based on the master code:
1616

17-
- in `cime/scripts` directory, do `./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 07e202a --project E3SM --walltime 00:30:00 -g -v -j 4`
17+
- in `cime/scripts` directory, do
18+
```
19+
./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 07e202a --project E3SM --walltime 00:30:00 -g -v -j 4
20+
```
1821
- here `07e202a` is the hashtag for the master
1922
- Then make changes to the code, commit it, and do another set of test for comparison
20-
- `./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 6ebc21d --project E3SM --walltime 00:30:00 -c -v -j 4`
23+
```
24+
./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 6ebc21d --project E3SM --walltime 00:30:00 -c -v -j 4
25+
```
2126
- here `6ebc21d` is the hashtag for the new commit
2227
- note in the command line the `-g` is changed to `-c`, indicating this is a comparison test
2328
- The test runs will generate a number of new folders in the scratch directory for the simulation results. Wait until the tests are completed, then go to the scratch directory, you will find a new file is generated. In this case, it should be `cs.status.6ebc21d`. Execute this file, you will see the comparison report showing if the tests are passed or failed. Here is an example for a passed test:
@@ -75,3 +80,12 @@ Then this configuration with new features is available for testing. Simply go to
7580
The last part of the command `.elm-bgc_features` indicates this is a "testmod" type of test.
7681

7782
If you want to make it official, just add this line to the top of the `tests.py` in `.../E3SM/cime_config`. So that in the future everyone will need to pass this test for their new developments.
83+
84+
### Single PEM test
85+
By defination, a PEM test is a b4b test for different PE layouts. But sometimes the test would crash if you only provide a test name such as `ne4pg2_ne4pg2.I1850CNPRDCTCBCTOP` as the machine may not have a working PE configuration. So a good practice is to leverage an exsisting testing layout.
86+
87+
Here's an example for Chrysalis:
88+
```
89+
./create_test PEM.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel --pesfile /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM_test_scripts/../E3SM/cime_config/testmods_dirs/config_pes_tests.xml
90+
```
91+
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
layout: post
3+
title: Negative Runoff Quick Fix
4+
date: 2025-11-03 17:45:00 # Adjust date/time as needed
5+
author: Tian # Or your name
6+
comments: true
7+
categories: [Programming, E3SM, AI, PR] # Example categories
8+
---
9+
10+
This is the Technical Note for the [Negative Runoff Quick Fix PR](https://github.com/E3SM-Project/E3SM/pull/7809).This PR implements a feature to handle negative runoff (qgwl) from ELM, which can cause negative discharge to the ocean. The draft of this document was generated by Claude and then revised by myself.
11+
12+
## Key Implementation Details
13+
- Feature controlled by namelist flag: `redirect_negative_qgwl = .true.`
14+
- Main code in `src/riverroute/RtmMod.F90` (lines 2406-3174)
15+
- Uses sparse packing approach with reprosum for bit-for-bit reproducibility across PE layouts
16+
- Two scenarios based on global net qgwl (positive sum + negative sum)
17+
18+
**Scenario A** (net_global_qgwl ≥ 0):
19+
- Proportionally reduce positive qgwl cells to offset negative cells
20+
- Zero out negative qgwl cells
21+
- Scaling factor: `net_global_qgwl / global_positive_qgwl_sum`
22+
- No outlet redistribution needed (conservation achieved by proportional reduction)
23+
24+
**Scenario B** (net_global_qgwl < 0):
25+
- Zero all qgwl input
26+
- Redistribute deficit to all outlets weighted by discharge
27+
- Each outlet receives: `correction = (outlet_discharge / total_outlet_discharge) × |net_global_qgwl|`
28+
- Warning issued if redistribution changes total outlet discharge by >5%
29+
30+
## Lesson Learned: MPI Sum vs Reprosum: Critical for Bit-for-Bit Reproducibility
31+
32+
### The Problem with Standard MPI_Allreduce
33+
34+
Standard MPI reduction operations (`MPI_Allreduce`, `MPI_Reduce`) are **NOT bit-for-bit reproducible** across different PE (processor) layouts because:
35+
36+
1. **Non-associativity of floating-point addition**: `(a + b) + c ≠ a + (b + c)` due to rounding errors
37+
2. **PE layout affects summation order**:
38+
- 160 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...`
39+
- 320 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...` with different groupings
40+
3. **Result**: Same simulation with different PE counts produces slightly different (~1e-13) global sums
41+
42+
### The Reprosum Solution
43+
44+
The `shr_reprosum_calc` function from CESM's shared utilities provides **bit-for-bit reproducible sums** using integer vector representation:
45+
46+
## Lesson Learned: PEM Test Fix: The Near-Zero Value Problem
47+
48+
**Root Cause of PEM Failure:**
49+
50+
The negative qgwl redistribution initially failed PEM tests because near-zero values (~±1e-20) were classified inconsistently:
51+
52+
1. **Upstream floating-point operations**: Different PE layouts produce slightly different rounding in ELM calculations
53+
2. **Sign flipping**: Value could be +1e-20 in 160 PE layout, -1e-20 in 320 PE layout
54+
3. **Classification inconsistency**: `if (qgwl > 0.0_r8)` classified these differently → different positive/negative sums
55+
4. **Non-reproducibility propagation**: Different sums → different scaling factors → different outputs across timesteps
56+
57+
**Example from actual logs:**
58+
```
59+
160 PEs: Positive sum = 1.2345678901234560E+03
60+
320 PEs: Positive sum = 1.2345678901234567E+03 (differs by ~7e-13)
61+
```
62+
63+
The negative sum was identical (bit-for-bit), but positive sum differed because near-zero values were classified differently.
64+
65+
**Solution: Sparse Packing with Threshold**
66+
67+
Following the pattern from MPAS ocean model, we implemented:
68+
69+
1. **Threshold parameter**: `TINYVALUE_s = 1.0e-14_r8`
70+
2. **Sparse packing**: Only include values with `abs(qgwl) > TINYVALUE_s` in reprosum
71+
3. **Consistent classification**: Use threshold in all comparisons (`qgwl > TINYVALUE_s` instead of `qgwl > 0.0_r8`)

0 commit comments

Comments
 (0)