Skip to content

Commit ffb4728

Browse files
patvarillyclaude
andcommitted
Add shelved plan for caching ref-sequence-derived quantities
Profiling showed calc_cum_Q_l_for_sequence (10.7%) and calc_state_frequencies_per_partition_of (7.1%) as the top remaining hotspots. Both depend only on ref_sequence and evo, which are shared across all subruns. Caching in Run and sharing via const pointers would save ~17.8% of cpu_core time (~1.2x speedup). Shelved to avoid major surgery during paper review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4d1df71 commit ffb4728

File tree

1 file changed

+206
-0
lines changed

1 file changed

+206
-0
lines changed
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Plan: Cache ref-sequence-derived quantities in Run, share with Subruns
2+
3+
**Status: SHELVED** — not worth the surgical risk during paper review.
4+
5+
## Problem
6+
7+
`calc_cum_Q_l_for_sequence` and `calc_state_frequencies_per_partition_of`
8+
are O(L) scans over `ref_sequence` that depend only on `ref_sequence`
9+
and `evo` — quantities that are identical across all subruns and
10+
already known at the `Run` level.
11+
12+
All subruns share the same `ref_sequence`: in `repartition()` (line
13+
136 of `core/run.cpp`), every subtree gets `subtree.ref_sequence =
14+
ref_seq` where `ref_seq = tree_.ref_sequence`. No subrun ever
15+
modifies `ref_sequence` — subruns only modify mutations, missations,
16+
topology, and node times. (The `DCHECK` at line 246 of `reassemble()`
17+
only checks the root-containing subrun, so it's not strong
18+
confirmation by itself; the real guarantee is that no code path in
19+
`Subrun` writes to `ref_sequence`.)
20+
21+
Each subrun independently recomputes these in
22+
`Subrun::recalc_derived_quantities()` after every global move cycle
23+
(~1,000 times over 20M steps for 200 tips). With `num_parts = 1`
24+
(default), there is one redundant recomputation per cycle; with N
25+
parts, N redundant recomputations.
26+
27+
`Run::recalc_derived_quantities()` (called at line 629 in
28+
`do_mcmc_steps`) already calls `derive_evo()` and computes
29+
`evo_`-dependent quantities *before* `push_global_params_to_subruns()`
30+
(line 641). So by the time subruns need these values, the `Run`
31+
already has the current `evo_`.
32+
33+
## Profiling notes
34+
35+
Initial profiling with default `--v0-log-every` (= steps/10000)
36+
exaggerated the cost of these functions because each
37+
`do_mcmc_steps()` call triggers a `repartition()`, and with
38+
`step_granularity = 2000` there were ~10,000 repartitions in a 20M
39+
step run. In realistic runs, `--v0-log-every` is much larger and
40+
repartitions are far less frequent.
41+
42+
With realistic repartition frequency (`--v0-log-every 200000`,
43+
~100 repartitions over 20M steps, `06_missing_data` sim_000):
44+
45+
- **97B cpu_core cycles, 17s wall time** (vs 255B / 39s with default
46+
log-every)
47+
- `calc_cum_Q_l_for_sequence`: **10.7%** (was 20.3%)
48+
- `calc_state_frequencies_per_partition_of`: **7.1%** (was 13.9%)
49+
- Combined: **17.8%** (was 34%)
50+
- Expected speedup from caching: **~1.2x** (was ~1.3-1.5x)
51+
52+
For comparison, the top MCMC move hotspots with realistic frequency:
53+
- `inner_node_displace_move`: 9.5%
54+
- `__nextafter`: 4.9%
55+
- `__ieee754_log_fma`: 4.1%
56+
- `calc_Ttwiddle_beta_a`: 2.7%
57+
- `branch_reform_move`: 2.3%
58+
59+
## Goal
60+
61+
Compute `ref_cum_Q_l` and
62+
`state_frequencies_of_ref_sequence_per_partition` once in `Run` after
63+
`evo_` changes, and share them read-only with subruns via const
64+
pointers, so subruns skip the redundant O(L) recomputations entirely.
65+
66+
## Changes
67+
68+
### 1. Cache in `Run::recalc_derived_quantities()`
69+
70+
Add two cached members to `Run`:
71+
72+
```cpp
73+
mutable std::vector<double> ref_cum_Q_l_{};
74+
mutable Partition_vector<Seq_vector<int>>
75+
state_frequencies_of_ref_sequence_per_partition_{};
76+
```
77+
78+
In `Run::recalc_derived_quantities()`, compute them right after
79+
`derive_evo()`:
80+
81+
```cpp
82+
ref_cum_Q_l_ = calc_cum_Q_l_for_sequence(tree_.ref_sequence, evo_);
83+
state_frequencies_of_ref_sequence_per_partition_ =
84+
calc_state_frequencies_per_partition_of(tree_.ref_sequence, evo_);
85+
```
86+
87+
Note: `Run` already has `state_frequencies_of_ref_sequence_`
88+
(unpartitioned, `Seq_vector<int>`), computed via
89+
`calc_cur_state_frequencies_of_ref_sequence()` which calls
90+
`calc_state_frequencies_per_partition_of(...)[0]`. Update it to
91+
derive from the cached partitioned version:
92+
93+
```cpp
94+
state_frequencies_of_ref_sequence_ =
95+
state_frequencies_of_ref_sequence_per_partition_[0];
96+
```
97+
98+
This works because `state_frequencies_of_ref_sequence_` is only used
99+
in the single-partition case (the mpox hack has 2 partitions but does
100+
not use this member for anything partition-sensitive — verify).
101+
102+
### 2. Share cached values with subruns via const pointers
103+
104+
Add const-pointer members to `Subrun`:
105+
106+
```cpp
107+
const std::vector<double>* shared_ref_cum_Q_l_ = nullptr;
108+
const Partition_vector<Seq_vector<int>>*
109+
shared_state_frequencies_of_ref_sequence_per_partition_ = nullptr;
110+
```
111+
112+
Add a single setter:
113+
114+
```cpp
115+
auto set_shared_ref_seq_derived_quantities(
116+
const std::vector<double>& ref_cum_Q_l,
117+
const Partition_vector<Seq_vector<int>>&
118+
state_frequencies_of_ref_sequence_per_partition) -> void {
119+
shared_ref_cum_Q_l_ = &ref_cum_Q_l;
120+
shared_state_frequencies_of_ref_sequence_per_partition_ =
121+
&state_frequencies_of_ref_sequence_per_partition;
122+
}
123+
```
124+
125+
In `push_global_params_to_subruns()`, call this after `set_evo()`:
126+
127+
```cpp
128+
subrun.set_shared_ref_seq_derived_quantities(
129+
ref_cum_Q_l_,
130+
state_frequencies_of_ref_sequence_per_partition_);
131+
```
132+
133+
### 3. Use shared values in `Subrun::recalc_derived_quantities()`
134+
135+
In `Subrun::recalc_derived_quantities()`, use the shared values
136+
instead of recomputing:
137+
138+
```cpp
139+
// Before:
140+
state_frequencies_of_ref_sequence_per_partition_ =
141+
calc_cur_state_frequencies_of_ref_sequence_per_partition();
142+
ref_cum_Q_l_ = calc_cur_ref_cum_Q_l();
143+
144+
// After:
145+
if (shared_state_frequencies_of_ref_sequence_per_partition_) {
146+
state_frequencies_of_ref_sequence_per_partition_ =
147+
*shared_state_frequencies_of_ref_sequence_per_partition_;
148+
} else {
149+
state_frequencies_of_ref_sequence_per_partition_ =
150+
calc_cur_state_frequencies_of_ref_sequence_per_partition();
151+
}
152+
if (shared_ref_cum_Q_l_) {
153+
ref_cum_Q_l_ = *shared_ref_cum_Q_l_;
154+
} else {
155+
ref_cum_Q_l_ = calc_cur_ref_cum_Q_l();
156+
}
157+
```
158+
159+
The fallback (when pointers are null) keeps `Subrun` usable
160+
standalone, e.g. in tests. The `calc_cur_*` methods remain for
161+
`check_derived_quantities()` in debug builds.
162+
163+
### 4. Lifetime safety
164+
165+
The shared pointers point into `Run`'s mutable cached members. These
166+
are recomputed in `Run::recalc_derived_quantities()`, which runs at
167+
line 629 of `do_mcmc_steps`*before*
168+
`push_global_params_to_subruns()` (line 641) and `run_local_moves()`
169+
(line 643). The cached values are stable for the entire local-moves
170+
phase. The pointers are refreshed every time
171+
`push_global_params_to_subruns()` is called, so they always point to
172+
the current values.
173+
174+
After `repartition()`, `push_global_params_to_subruns()` runs before
175+
any subrun iteration (line 641 in the main loop), so newly
176+
constructed subruns will have their pointers set before first use.
177+
178+
## Files changed
179+
180+
- `core/run.h` — add `ref_cum_Q_l_` and
181+
`state_frequencies_of_ref_sequence_per_partition_` cached members
182+
- `core/run.cpp` — compute in `recalc_derived_quantities()`, pass in
183+
`push_global_params_to_subruns()`, derive
184+
`state_frequencies_of_ref_sequence_` from cached partitioned version
185+
- `core/subrun.h` — add shared const-pointer members and setter
186+
- `core/subrun.cpp` — use shared values in
187+
`recalc_derived_quantities()` when available
188+
189+
## Validation
190+
191+
1. **Baseline profile** (with realistic log-every): 97B cpu_core
192+
cycles, 17s wall time on `06_missing_data` sim_000, 20M steps,
193+
`--v0-log-every 200000`.
194+
2. Run the existing Delphy test suite (`build/release/tests/tests`).
195+
3. **After profile**: re-profile on the same dataset and workload,
196+
compare cpu_core cycles and wall time to the baseline. Expect
197+
`calc_cum_Q_l_for_sequence` and
198+
`calc_state_frequencies_per_partition_of` to drop from ~17.8%
199+
combined to near-zero in subrun context.
200+
201+
## Expected impact
202+
203+
With realistic repartition frequency, these two functions account for
204+
~17.8% of cpu_core time. With sharing, each is computed once per
205+
global move cycle (in `Run`) instead of once per subrun per cycle.
206+
Expected speedup: ~1.2x.

0 commit comments

Comments
 (0)