|
| 1 | +# Plan: Cache ref-sequence-derived quantities in Run, share with Subruns |
| 2 | + |
| 3 | +**Status: SHELVED** — not worth the surgical risk during paper review. |
| 4 | + |
| 5 | +## Problem |
| 6 | + |
| 7 | +`calc_cum_Q_l_for_sequence` and `calc_state_frequencies_per_partition_of` |
| 8 | +are O(L) scans over `ref_sequence` that depend only on `ref_sequence` |
| 9 | +and `evo` — quantities that are identical across all subruns and |
| 10 | +already known at the `Run` level. |
| 11 | + |
| 12 | +All subruns share the same `ref_sequence`: in `repartition()` (line |
| 13 | +136 of `core/run.cpp`), every subtree gets `subtree.ref_sequence = |
| 14 | +ref_seq` where `ref_seq = tree_.ref_sequence`. No subrun ever |
| 15 | +modifies `ref_sequence` — subruns only modify mutations, missations, |
| 16 | +topology, and node times. (The `DCHECK` at line 246 of `reassemble()` |
| 17 | +only checks the root-containing subrun, so it's not strong |
| 18 | +confirmation by itself; the real guarantee is that no code path in |
| 19 | +`Subrun` writes to `ref_sequence`.) |
| 20 | + |
| 21 | +Each subrun independently recomputes these in |
| 22 | +`Subrun::recalc_derived_quantities()` after every global move cycle |
| 23 | +(~1,000 times over 20M steps for 200 tips). With `num_parts = 1` |
| 24 | +(default), there is one redundant recomputation per cycle; with N |
| 25 | +parts, N redundant recomputations. |
| 26 | + |
| 27 | +`Run::recalc_derived_quantities()` (called at line 629 in |
| 28 | +`do_mcmc_steps`) already calls `derive_evo()` and computes |
| 29 | +`evo_`-dependent quantities *before* `push_global_params_to_subruns()` |
| 30 | +(line 641). So by the time subruns need these values, the `Run` |
| 31 | +already has the current `evo_`. |
| 32 | + |
| 33 | +## Profiling notes |
| 34 | + |
| 35 | +Initial profiling with default `--v0-log-every` (= steps/10000) |
| 36 | +exaggerated the cost of these functions because each |
| 37 | +`do_mcmc_steps()` call triggers a `repartition()`, and with |
| 38 | +`step_granularity = 2000` there were ~10,000 repartitions in a 20M |
| 39 | +step run. In realistic runs, `--v0-log-every` is much larger and |
| 40 | +repartitions are far less frequent. |
| 41 | + |
| 42 | +With realistic repartition frequency (`--v0-log-every 200000`, |
| 43 | +~100 repartitions over 20M steps, `06_missing_data` sim_000): |
| 44 | + |
| 45 | +- **97B cpu_core cycles, 17s wall time** (vs 255B / 39s with default |
| 46 | + log-every) |
| 47 | +- `calc_cum_Q_l_for_sequence`: **10.7%** (was 20.3%) |
| 48 | +- `calc_state_frequencies_per_partition_of`: **7.1%** (was 13.9%) |
| 49 | +- Combined: **17.8%** (was 34%) |
| 50 | +- Expected speedup from caching: **~1.2x** (was ~1.3-1.5x) |
| 51 | + |
| 52 | +For comparison, the top MCMC move hotspots with realistic frequency: |
| 53 | +- `inner_node_displace_move`: 9.5% |
| 54 | +- `__nextafter`: 4.9% |
| 55 | +- `__ieee754_log_fma`: 4.1% |
| 56 | +- `calc_Ttwiddle_beta_a`: 2.7% |
| 57 | +- `branch_reform_move`: 2.3% |
| 58 | + |
| 59 | +## Goal |
| 60 | + |
| 61 | +Compute `ref_cum_Q_l` and |
| 62 | +`state_frequencies_of_ref_sequence_per_partition` once in `Run` after |
| 63 | +`evo_` changes, and share them read-only with subruns via const |
| 64 | +pointers, so subruns skip the redundant O(L) recomputations entirely. |
| 65 | + |
| 66 | +## Changes |
| 67 | + |
| 68 | +### 1. Cache in `Run::recalc_derived_quantities()` |
| 69 | + |
| 70 | +Add two cached members to `Run`: |
| 71 | + |
| 72 | +```cpp |
| 73 | +mutable std::vector<double> ref_cum_Q_l_{}; |
| 74 | +mutable Partition_vector<Seq_vector<int>> |
| 75 | + state_frequencies_of_ref_sequence_per_partition_{}; |
| 76 | +``` |
| 77 | + |
| 78 | +In `Run::recalc_derived_quantities()`, compute them right after |
| 79 | +`derive_evo()`: |
| 80 | + |
| 81 | +```cpp |
| 82 | +ref_cum_Q_l_ = calc_cum_Q_l_for_sequence(tree_.ref_sequence, evo_); |
| 83 | +state_frequencies_of_ref_sequence_per_partition_ = |
| 84 | + calc_state_frequencies_per_partition_of(tree_.ref_sequence, evo_); |
| 85 | +``` |
| 86 | +
|
| 87 | +Note: `Run` already has `state_frequencies_of_ref_sequence_` |
| 88 | +(unpartitioned, `Seq_vector<int>`), computed via |
| 89 | +`calc_cur_state_frequencies_of_ref_sequence()` which calls |
| 90 | +`calc_state_frequencies_per_partition_of(...)[0]`. Update it to |
| 91 | +derive from the cached partitioned version: |
| 92 | +
|
| 93 | +```cpp |
| 94 | +state_frequencies_of_ref_sequence_ = |
| 95 | + state_frequencies_of_ref_sequence_per_partition_[0]; |
| 96 | +``` |
| 97 | + |
| 98 | +This works because `state_frequencies_of_ref_sequence_` is only used |
| 99 | +in the single-partition case (the mpox hack has 2 partitions but does |
| 100 | +not use this member for anything partition-sensitive — verify). |
| 101 | + |
| 102 | +### 2. Share cached values with subruns via const pointers |
| 103 | + |
| 104 | +Add const-pointer members to `Subrun`: |
| 105 | + |
| 106 | +```cpp |
| 107 | +const std::vector<double>* shared_ref_cum_Q_l_ = nullptr; |
| 108 | +const Partition_vector<Seq_vector<int>>* |
| 109 | + shared_state_frequencies_of_ref_sequence_per_partition_ = nullptr; |
| 110 | +``` |
| 111 | + |
| 112 | +Add a single setter: |
| 113 | + |
| 114 | +```cpp |
| 115 | +auto set_shared_ref_seq_derived_quantities( |
| 116 | + const std::vector<double>& ref_cum_Q_l, |
| 117 | + const Partition_vector<Seq_vector<int>>& |
| 118 | + state_frequencies_of_ref_sequence_per_partition) -> void { |
| 119 | + shared_ref_cum_Q_l_ = &ref_cum_Q_l; |
| 120 | + shared_state_frequencies_of_ref_sequence_per_partition_ = |
| 121 | + &state_frequencies_of_ref_sequence_per_partition; |
| 122 | +} |
| 123 | +``` |
| 124 | +
|
| 125 | +In `push_global_params_to_subruns()`, call this after `set_evo()`: |
| 126 | +
|
| 127 | +```cpp |
| 128 | +subrun.set_shared_ref_seq_derived_quantities( |
| 129 | + ref_cum_Q_l_, |
| 130 | + state_frequencies_of_ref_sequence_per_partition_); |
| 131 | +``` |
| 132 | + |
| 133 | +### 3. Use shared values in `Subrun::recalc_derived_quantities()` |
| 134 | + |
| 135 | +In `Subrun::recalc_derived_quantities()`, use the shared values |
| 136 | +instead of recomputing: |
| 137 | + |
| 138 | +```cpp |
| 139 | +// Before: |
| 140 | +state_frequencies_of_ref_sequence_per_partition_ = |
| 141 | + calc_cur_state_frequencies_of_ref_sequence_per_partition(); |
| 142 | +ref_cum_Q_l_ = calc_cur_ref_cum_Q_l(); |
| 143 | + |
| 144 | +// After: |
| 145 | +if (shared_state_frequencies_of_ref_sequence_per_partition_) { |
| 146 | + state_frequencies_of_ref_sequence_per_partition_ = |
| 147 | + *shared_state_frequencies_of_ref_sequence_per_partition_; |
| 148 | +} else { |
| 149 | + state_frequencies_of_ref_sequence_per_partition_ = |
| 150 | + calc_cur_state_frequencies_of_ref_sequence_per_partition(); |
| 151 | +} |
| 152 | +if (shared_ref_cum_Q_l_) { |
| 153 | + ref_cum_Q_l_ = *shared_ref_cum_Q_l_; |
| 154 | +} else { |
| 155 | + ref_cum_Q_l_ = calc_cur_ref_cum_Q_l(); |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +The fallback (when pointers are null) keeps `Subrun` usable |
| 160 | +standalone, e.g. in tests. The `calc_cur_*` methods remain for |
| 161 | +`check_derived_quantities()` in debug builds. |
| 162 | + |
| 163 | +### 4. Lifetime safety |
| 164 | + |
| 165 | +The shared pointers point into `Run`'s mutable cached members. These |
| 166 | +are recomputed in `Run::recalc_derived_quantities()`, which runs at |
| 167 | +line 629 of `do_mcmc_steps` — *before* |
| 168 | +`push_global_params_to_subruns()` (line 641) and `run_local_moves()` |
| 169 | +(line 643). The cached values are stable for the entire local-moves |
| 170 | +phase. The pointers are refreshed every time |
| 171 | +`push_global_params_to_subruns()` is called, so they always point to |
| 172 | +the current values. |
| 173 | + |
| 174 | +After `repartition()`, `push_global_params_to_subruns()` runs before |
| 175 | +any subrun iteration (line 641 in the main loop), so newly |
| 176 | +constructed subruns will have their pointers set before first use. |
| 177 | + |
| 178 | +## Files changed |
| 179 | + |
| 180 | +- `core/run.h` — add `ref_cum_Q_l_` and |
| 181 | + `state_frequencies_of_ref_sequence_per_partition_` cached members |
| 182 | +- `core/run.cpp` — compute in `recalc_derived_quantities()`, pass in |
| 183 | + `push_global_params_to_subruns()`, derive |
| 184 | + `state_frequencies_of_ref_sequence_` from cached partitioned version |
| 185 | +- `core/subrun.h` — add shared const-pointer members and setter |
| 186 | +- `core/subrun.cpp` — use shared values in |
| 187 | + `recalc_derived_quantities()` when available |
| 188 | + |
| 189 | +## Validation |
| 190 | + |
| 191 | +1. **Baseline profile** (with realistic log-every): 97B cpu_core |
| 192 | + cycles, 17s wall time on `06_missing_data` sim_000, 20M steps, |
| 193 | + `--v0-log-every 200000`. |
| 194 | +2. Run the existing Delphy test suite (`build/release/tests/tests`). |
| 195 | +3. **After profile**: re-profile on the same dataset and workload, |
| 196 | + compare cpu_core cycles and wall time to the baseline. Expect |
| 197 | + `calc_cum_Q_l_for_sequence` and |
| 198 | + `calc_state_frequencies_per_partition_of` to drop from ~17.8% |
| 199 | + combined to near-zero in subrun context. |
| 200 | + |
| 201 | +## Expected impact |
| 202 | + |
| 203 | +With realistic repartition frequency, these two functions account for |
| 204 | +~17.8% of cpu_core time. With sharing, each is computed once per |
| 205 | +global move cycle (in `Run`) instead of once per subrun per cycle. |
| 206 | +Expected speedup: ~1.2x. |
0 commit comments