add a post

hydrotian · hydrotian · commit 539e67010306 · 2025-11-03T15:55:54.000-08:00
diff --git a/_posts/2021-07-29-E3SM-user-defined-test.md b/_posts/2021-07-29-E3SM-user-defined-test.md
@@ -14,10 +14,15 @@ Before merging a Pull Request to master, the model integrators need to perform a
 - First create a branch off master then switch to that branch
 - Say we want to do a testing suite called "e3sm_land_developer", the next step is to create baseline cases based on the master code:
 
-  - in `cime/scripts` directory, do `./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 07e202a --project E3SM --walltime 00:30:00 -g -v -j 4`
+  - in `cime/scripts` directory, do 
+  ```
+  ./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 07e202a --project E3SM --walltime 00:30:00 -g -v -j 4
+  ```
   - here `07e202a` is the hashtag for the master
 - Then make changes to the code, commit it, and do another set of test for comparison
-  - `./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 6ebc21d --project E3SM --walltime 00:30:00 -c -v -j 4`
+  ```
+  ./create_test e3sm_land_developer --baseline-root /path/to/baseline/case/dir -b 07e202a -t 6ebc21d --project E3SM --walltime 00:30:00 -c -v -j 4
+  ```
   - here `6ebc21d` is the hashtag for the new commit
   - note in the command line the `-g` is changed to `-c`, indicating this is a comparison test
 - The test runs will generate a number of new folders in the scratch directory for the simulation results. Wait until the tests are completed, then go to the scratch directory, you will find a new file is generated. In this case, it should be `cs.status.6ebc21d`. Execute this file, you will see the comparison report showing if the tests are passed or failed. Here is an example for a passed test:
@@ -75,3 +80,12 @@ Then this configuration with new features is available for testing. Simply go to
 The last part of the command `.elm-bgc_features` indicates this is a "testmod" type of test.
 
 If you want to make it official, just add this line to the top of the `tests.py` in `.../E3SM/cime_config`. So that in the future everyone will need to pass this test for their new developments.
+
+### Single PEM test
+By defination, a PEM test is a b4b test for different PE layouts. But sometimes the test would crash if you only provide a test name such as `ne4pg2_ne4pg2.I1850CNPRDCTCBCTOP` as the machine may not have a working PE configuration. So a good practice is to leverage an exsisting testing layout.
+
+Here's an example for Chrysalis:
+```
+./create_test PEM.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel --pesfile /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM_test_scripts/../E3SM/cime_config/testmods_dirs/config_pes_tests.xml
+```
+
diff --git a/_posts/2025-11-03-negative-runoff-quick-fix.md b/_posts/2025-11-03-negative-runoff-quick-fix.md
@@ -0,0 +1,71 @@
+---
+layout: post
+title: Negative Runoff Quick Fix
+date: 2025-11-03 17:45:00 # Adjust date/time as needed
+author: Tian # Or your name
+comments: true
+categories: [Programming, E3SM, AI, PR] # Example categories
+---
+
+This is the Technical Note for the [Negative Runoff Quick Fix PR](https://github.com/E3SM-Project/E3SM/pull/7809).This PR implements a feature to handle negative runoff (qgwl) from ELM, which can cause negative discharge to the ocean. The draft of this document was generated by Claude and then revised by myself.
+
+## Key Implementation Details
+- Feature controlled by namelist flag: `redirect_negative_qgwl = .true.`
+- Main code in `src/riverroute/RtmMod.F90` (lines 2406-3174)
+- Uses sparse packing approach with reprosum for bit-for-bit reproducibility across PE layouts
+- Two scenarios based on global net qgwl (positive sum + negative sum)
+
+**Scenario A** (net_global_qgwl ≥ 0):
+- Proportionally reduce positive qgwl cells to offset negative cells
+- Zero out negative qgwl cells
+- Scaling factor: `net_global_qgwl / global_positive_qgwl_sum`
+- No outlet redistribution needed (conservation achieved by proportional reduction)
+
+**Scenario B** (net_global_qgwl < 0):
+- Zero all qgwl input
+- Redistribute deficit to all outlets weighted by discharge
+- Each outlet receives: `correction = (outlet_discharge / total_outlet_discharge) × |net_global_qgwl|`
+- Warning issued if redistribution changes total outlet discharge by >5%
+
+## Lesson Learned: MPI Sum vs Reprosum: Critical for Bit-for-Bit Reproducibility
+
+### The Problem with Standard MPI_Allreduce
+
+Standard MPI reduction operations (`MPI_Allreduce`, `MPI_Reduce`) are **NOT bit-for-bit reproducible** across different PE (processor) layouts because:
+
+1. **Non-associativity of floating-point addition**: `(a + b) + c ≠ a + (b + c)` due to rounding errors
+2. **PE layout affects summation order**:
+   - 160 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...`
+   - 320 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...` with different groupings
+3. **Result**: Same simulation with different PE counts produces slightly different (~1e-13) global sums
+
+### The Reprosum Solution
+
+The `shr_reprosum_calc` function from CESM's shared utilities provides **bit-for-bit reproducible sums** using integer vector representation:
+
+## Lesson Learned: PEM Test Fix: The Near-Zero Value Problem
+
+**Root Cause of PEM Failure:**
+
+The negative qgwl redistribution initially failed PEM tests because near-zero values (~±1e-20) were classified inconsistently:
+
+1. **Upstream floating-point operations**: Different PE layouts produce slightly different rounding in ELM calculations
+2. **Sign flipping**: Value could be +1e-20 in 160 PE layout, -1e-20 in 320 PE layout
+3. **Classification inconsistency**: `if (qgwl > 0.0_r8)` classified these differently → different positive/negative sums
+4. **Non-reproducibility propagation**: Different sums → different scaling factors → different outputs across timesteps
+
+**Example from actual logs:**
+```
+160 PEs: Positive sum = 1.2345678901234560E+03
+320 PEs: Positive sum = 1.2345678901234567E+03  (differs by ~7e-13)
+```
+
+The negative sum was identical (bit-for-bit), but positive sum differed because near-zero values were classified differently.
+
+**Solution: Sparse Packing with Threshold**
+
+Following the pattern from MPAS ocean model, we implemented:
+
+1. **Threshold parameter**: `TINYVALUE_s = 1.0e-14_r8`
+2. **Sparse packing**: Only include values with `abs(qgwl) > TINYVALUE_s` in reprosum
+3. **Consistent classification**: Use threshold in all comparisons (`qgwl > TINYVALUE_s` instead of `qgwl > 0.0_r8`)