|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Negative Runoff Quick Fix |
| 4 | +date: 2025-11-03 17:45:00 # Adjust date/time as needed |
| 5 | +author: Tian # Or your name |
| 6 | +comments: true |
| 7 | +categories: [Programming, E3SM, AI, PR] # Example categories |
| 8 | +--- |
| 9 | + |
| 10 | +This is the Technical Note for the [Negative Runoff Quick Fix PR](https://github.com/E3SM-Project/E3SM/pull/7809).This PR implements a feature to handle negative runoff (qgwl) from ELM, which can cause negative discharge to the ocean. The draft of this document was generated by Claude and then revised by myself. |
| 11 | + |
| 12 | +## Key Implementation Details |
| 13 | +- Feature controlled by namelist flag: `redirect_negative_qgwl = .true.` |
| 14 | +- Main code in `src/riverroute/RtmMod.F90` (lines 2406-3174) |
| 15 | +- Uses sparse packing approach with reprosum for bit-for-bit reproducibility across PE layouts |
| 16 | +- Two scenarios based on global net qgwl (positive sum + negative sum) |
| 17 | + |
| 18 | +**Scenario A** (net_global_qgwl ≥ 0): |
| 19 | +- Proportionally reduce positive qgwl cells to offset negative cells |
| 20 | +- Zero out negative qgwl cells |
| 21 | +- Scaling factor: `net_global_qgwl / global_positive_qgwl_sum` |
| 22 | +- No outlet redistribution needed (conservation achieved by proportional reduction) |
| 23 | + |
| 24 | +**Scenario B** (net_global_qgwl < 0): |
| 25 | +- Zero all qgwl input |
| 26 | +- Redistribute deficit to all outlets weighted by discharge |
| 27 | +- Each outlet receives: `correction = (outlet_discharge / total_outlet_discharge) × |net_global_qgwl|` |
| 28 | +- Warning issued if redistribution changes total outlet discharge by >5% |
| 29 | + |
| 30 | +## Lesson Learned: MPI Sum vs Reprosum: Critical for Bit-for-Bit Reproducibility |
| 31 | + |
| 32 | +### The Problem with Standard MPI_Allreduce |
| 33 | + |
| 34 | +Standard MPI reduction operations (`MPI_Allreduce`, `MPI_Reduce`) are **NOT bit-for-bit reproducible** across different PE (processor) layouts because: |
| 35 | + |
| 36 | +1. **Non-associativity of floating-point addition**: `(a + b) + c ≠ a + (b + c)` due to rounding errors |
| 37 | +2. **PE layout affects summation order**: |
| 38 | + - 160 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...` |
| 39 | + - 320 PEs: might sum as `((PE0 + PE1) + (PE2 + PE3)) + ...` with different groupings |
| 40 | +3. **Result**: Same simulation with different PE counts produces slightly different (~1e-13) global sums |
| 41 | + |
| 42 | +### The Reprosum Solution |
| 43 | + |
| 44 | +The `shr_reprosum_calc` function from CESM's shared utilities provides **bit-for-bit reproducible sums** using integer vector representation: |
| 45 | + |
| 46 | +## Lesson Learned: PEM Test Fix: The Near-Zero Value Problem |
| 47 | + |
| 48 | +**Root Cause of PEM Failure:** |
| 49 | + |
| 50 | +The negative qgwl redistribution initially failed PEM tests because near-zero values (~±1e-20) were classified inconsistently: |
| 51 | + |
| 52 | +1. **Upstream floating-point operations**: Different PE layouts produce slightly different rounding in ELM calculations |
| 53 | +2. **Sign flipping**: Value could be +1e-20 in 160 PE layout, -1e-20 in 320 PE layout |
| 54 | +3. **Classification inconsistency**: `if (qgwl > 0.0_r8)` classified these differently → different positive/negative sums |
| 55 | +4. **Non-reproducibility propagation**: Different sums → different scaling factors → different outputs across timesteps |
| 56 | + |
| 57 | +**Example from actual logs:** |
| 58 | +``` |
| 59 | +160 PEs: Positive sum = 1.2345678901234560E+03 |
| 60 | +320 PEs: Positive sum = 1.2345678901234567E+03 (differs by ~7e-13) |
| 61 | +``` |
| 62 | + |
| 63 | +The negative sum was identical (bit-for-bit), but positive sum differed because near-zero values were classified differently. |
| 64 | + |
| 65 | +**Solution: Sparse Packing with Threshold** |
| 66 | + |
| 67 | +Following the pattern from MPAS ocean model, we implemented: |
| 68 | + |
| 69 | +1. **Threshold parameter**: `TINYVALUE_s = 1.0e-14_r8` |
| 70 | +2. **Sparse packing**: Only include values with `abs(qgwl) > TINYVALUE_s` in reprosum |
| 71 | +3. **Consistent classification**: Use threshold in all comparisons (`qgwl > TINYVALUE_s` instead of `qgwl > 0.0_r8`) |
0 commit comments