|
| 1 | +# cam4 (QPC4) bit-for-bit difference: root cause is constituent registration order |
| 2 | + |
| 3 | +**Status:** root cause found and **proven**. Decision requested from CAM-SIMA. |
| 4 | +**Date:** 2026-06-11 **Author:** D. Heinzeller |
| 5 | + |
| 6 | +## Executive summary |
| 7 | + |
| 8 | +The CAM-SIMA test |
| 9 | +`SMS_D_Ln9.mpasa120_mpasa120.QPC4.derecho_intel.cam-outfrq_analy_ic_cam4` |
| 10 | +(full cam4 physics on the MPAS dynamical core) fails its bit-for-bit (b4b) |
| 11 | +comparison against the capgen baseline. The difference is **machine-epsilon |
| 12 | +roundoff** — state and flux fields agree to 14–17 significant digits; the |
| 13 | +comparison is loud only in RK-microphysics *ratio* diagnostics (e.g. `FWAUT`, |
| 14 | +RMS ≈ 4.24e-2), which are ratios of two near-zero autoconversion rates and so |
| 15 | +amplify any roundoff. The behavior is identical under GNU and Intel. |
| 16 | + |
| 17 | +The physics source, `suite_cam4.xml`, and `src/data/registry.xml` are |
| 18 | +**byte-identical** between the two builds. The difference is purely in the |
| 19 | +generated CCPP caps. We have traced it to a single cause and **proven** it: |
| 20 | + |
| 21 | +> **capgen-ng registers the advected constituents in a different order than the |
| 22 | +> original capgen.** Specifically, `cloud_liquid` and `cloud_ice` |
| 23 | +> are swapped. This changes the floating-point summation order in the energy/water |
| 24 | +> thermodynamic diagnostics, which the energy fixer then spreads across all columns |
| 25 | +> as a tiny, pervasive heating — the source of the b4b difference. |
| 26 | +
|
| 27 | +A one-off patch that forces capgen-ng's advected water species into the |
| 28 | +original-capgen order makes **QPC4 bit-for-bit identical** to the baseline. |
| 29 | + |
| 30 | +## The difference (runtime constituent list, `debug_output = 2`) |
| 31 | + |
| 32 | +| index | original capgen (baseline) | capgen-ng | |
| 33 | +|------:|----------------------------|-----------| |
| 34 | +| 1 | **cloud_liquid** (advected) | **cloud_ice** (advected) | |
| 35 | +| 2 | **cloud_ice** (advected) | **cloud_liquid** (advected) | |
| 36 | +| 3 | water_vapor (advected) | water_vapor (advected) | |
| 37 | +| 4–10 | CFC12, O3, CH4, O2, N2O, CFC11, CO2 | CFC12, O2, CH4, CO2, O3, N2O, CFC11 | |
| 38 | + |
| 39 | +Indices 1–3 are the advected water species; 4–10 are non-advected trace gases. |
| 40 | +The advected block is what matters (see mechanism). `water_vapor` is index 3 in |
| 41 | +both — the only advected difference is the **cloud_liquid ↔ cloud_ice swap**. |
| 42 | + |
| 43 | +## Mechanism |
| 44 | + |
| 45 | +1. `air_composition` builds `thermodynamic_active_species_idx` by walking the |
| 46 | + advected constituents in **constituent-index order**. |
| 47 | +2. `get_hydrostatic_energy` (`cam_thermo`) sums the water species in that order. |
| 48 | + Baseline sums `cloud_liquid + cloud_ice + water_vapor`; capgen-ng sums |
| 49 | + `cloud_ice + cloud_liquid + water_vapor`. Same values, **different FP order**. |
| 50 | +3. The resulting machine-eps difference in total energy/water is picked up by the |
| 51 | + global energy fixer (`check_energy_fix`), which redistributes it as a uniform |
| 52 | + heating across all columns. From that point the two runs differ at roundoff |
| 53 | + level everywhere, surfacing loudly only in ratio diagnostics like `FWAUT`. |
| 54 | + |
| 55 | +`air_composition.F90` and `cam_constituents.F90` are byte-identical between the |
| 56 | +two builds, so the entire difference originates in the registration order the |
| 57 | +generated cap produces. In the CCPP framework, registration order is the |
| 58 | +hash-table iteration order in `ccpp_model_constituents_t%lock_table` (advected |
| 59 | +packed first) — i.e. an arbitrary, generator-dependent order, not a deliberate |
| 60 | +physical ordering. |
| 61 | + |
| 62 | +## Proof |
| 63 | + |
| 64 | +Forcing capgen-ng's advected water species into the baseline order |
| 65 | +`[cloud_liquid = 1, cloud_ice = 2, water_vapor = 3]` (a flag-guarded one-off |
| 66 | +patch in the framework's `ccp_model_const_table_lock`) makes QPC4 reproduce the |
| 67 | +ccpp-prebuild baseline **bit-for-bit** (cprnc: all fields identical). This |
| 68 | +isolates constituent ordering as the *sole* cause. Patch (file `ccpp_constituent_prop_mod.F90.patch` in the top-level directory of the `feature/capgen-ng` ccpp-framework branch): |
| 69 | + |
| 70 | +``` |
| 71 | +--- capgen-ng/src/ccpp_constituent_prop_mod.F90 |
| 72 | ++++ capgen-ng/src/ccpp_constituent_prop_mod.F90 |
| 73 | +@@ -1392,6 +1392,17 @@ |
| 74 | + type(ccpp_constituent_properties_t), pointer :: cprop |
| 75 | + character(len=dimname_len) :: dimname |
| 76 | + character(len=*), parameter :: subname = 'ccp_model_const_table_lock' |
| 77 | ++ ! === ONE-OFF cam4 constituent-reorder experiment === |
| 78 | ++ ! When .true., force the cam4 advected water species into original-capgen |
| 79 | ++ ! order [cloud_liquid=1, cloud_ice=2, water_vapor=3] instead of hash-table |
| 80 | ++ ! order, to prove the FWAUT b4b diff is driven purely by constituent order. |
| 81 | ++ ! Only the 3 cam4 water-species std-names are remapped; everything else keeps |
| 82 | ++ ! its normal hash-order index, so other suites are unaffected unless they |
| 83 | ++ ! advect exactly these names. Flip to .false. (or delete) to restore. |
| 84 | ++ logical, parameter :: l_const_reorder = .true. |
| 85 | ++ integer :: const_pos |
| 86 | ++ character(len=512) :: sname_reorder |
| 87 | ++ ! === end experiment === |
| 88 | + |
| 89 | + astat = 0 |
| 90 | + errcode_local = 0 |
| 91 | +@@ -1460,9 +1471,24 @@ |
| 92 | + errcode_local = errcode_local + 1 |
| 93 | + exit |
| 94 | + end if |
| 95 | +- call cprop%set_const_index(index_advect, & |
| 96 | ++ ! === ONE-OFF cam4 constituent-reorder experiment === |
| 97 | ++ const_pos = index_advect |
| 98 | ++ if (l_const_reorder) then |
| 99 | ++ call cprop%standard_name(sname_reorder, & |
| 100 | ++ errcode=errcode, errmsg=errmsg) |
| 101 | ++ select case (trim(sname_reorder)) |
| 102 | ++ case ('cloud_liquid_water_mixing_ratio_wrt_moist_air_and_condensed_water') |
| 103 | ++ const_pos = 1 |
| 104 | ++ case ('cloud_ice_mixing_ratio_wrt_moist_air_and_condensed_water') |
| 105 | ++ const_pos = 2 |
| 106 | ++ case ('water_vapor_mixing_ratio_wrt_moist_air_and_condensed_water') |
| 107 | ++ const_pos = 3 |
| 108 | ++ end select |
| 109 | ++ end if |
| 110 | ++ call cprop%set_const_index(const_pos, & |
| 111 | + errcode=errcode, errmsg=errmsg) |
| 112 | +- call this%const_metadata(index_advect)%set(cprop) |
| 113 | ++ call this%const_metadata(const_pos)%set(cprop) |
| 114 | ++ ! === end experiment === |
| 115 | + else |
| 116 | + index_const = index_const + 1 |
| 117 | + if (index_const > num_vars) then |
| 118 | +``` |
| 119 | + |
| 120 | +## Assessment — neither order is "wrong" |
| 121 | + |
| 122 | +Both builds register the same constituents with identical properties; the |
| 123 | +ordering is not physically meaningful, and the resulting solutions are |
| 124 | +roundoff-equivalent and both physically correct. The b4b failure reflects only |
| 125 | +that capgen-ng's (arbitrary) order differs from the (equally arbitrary) order |
| 126 | +the capgen baseline happened to produce. |
| 127 | + |
| 128 | +## Decision requested |
| 129 | + |
| 130 | +To resolve QPC4 (and any other case sensitive to constituent order), we propose: |
| 131 | + |
| 132 | +1. Give capgen-ng a **deterministic, documented** constituent-registration order |
| 133 | + (e.g. water vapor first, with a clear rule for how constituents land in the |
| 134 | + array) — replacing today's hash-bucket order. |
| 135 | +2. Adopt the new documented order and **re-baseline** the affected CAM-SIMA cases once. |
| 136 | + |
| 137 | +The temporary proof patch will be removed once the path is agreed. |
| 138 | + |
| 139 | +## Artifacts |
| 140 | + |
| 141 | +- **Patch (git diff):** `<FILL IN: path to the .patch / repo+commit>` — |
| 142 | + reproduce with |
| 143 | + `git -C EXT/cam-sima-ng/ccpp_framework/capgen-ng diff src/ccpp_constituent_prop_mod.F90`. |
| 144 | +- **Run directories (Derecho):** |
| 145 | + - Baseline (original capgen): `<FILL IN>` |
| 146 | + - capgen-ng, unpatched (shows the FWAUT diff): `<FILL IN>` |
| 147 | + - capgen-ng + reorder patch (**b4b**): `<FILL IN>` |
| 148 | +- **cprnc summaries:** |
| 149 | + - unpatched vs baseline: `<FILL IN: FWAUT RMS ≈ 4.24e-2, state fields ~15 digits>` |
| 150 | + - patched vs baseline: `<FILL IN: all fields identical (b4b)>` |
| 151 | +- **Constituent lists (`debug_output = 2`, `atm.log`):** as tabulated above. |
0 commit comments