-
Notifications
You must be signed in to change notification settings - Fork 0
Description
issue: 25
title: "Add --dag-quick-info flag to dagutil"
Problem
dagutil --dag-info computes expensive statistics (parsimony score distribution across all trees, pairwise RF distance distribution) that require enumerating the full weight distribution over all trees in the DAG. On production DAGs (194–296 taxa, ~1000–4000 nodes/edges), --dag-info runs for 20+ minutes at 100% CPU without completing.
The cheap stats (leaves, nodes, edges, tree_count, parsimony_min) are bundled together with the expensive distribution computations, so there is no way to get just the fast ones.
Root cause
The cost difference comes from how subtree_weight is instantiated:
- Cheap:
subtree_weight<tree_count_ops>andsubtree_weight<parsimony_score_ops>each do a single bottom-up DP pass in O(nodes + edges).tree_count_opssums alternatives within a clade and multiplies across clades;parsimony_score_opstakes the min within a clade and sums across clades. Both are memoized per-node. - Expensive:
subtree_weight<weight_accumulator<parsimony_score_ops>>andsubtree_weight<weight_accumulator<sum_rf_distance>>track the full distribution of weights. Theweight_accumulatorwrapper maintains aweight_counter(flat_map from score → count), and the cartesian-product step at each clade can cause combinatorial blowup in map size. This is what takes 20+ minutes.
Proposal
Add a --dag-quick-info flag that prints tree_count and parsimony_min in addition to the basic stats (leaves, nodes, edges) that are already printed unconditionally:
| Stat | Source | Cost | Notes |
|---|---|---|---|
leaves |
leaf_count(result) |
O(n) | Already printed unconditionally |
nodes |
node_count(result) |
O(1) | Already printed unconditionally |
edges |
edge_count(result) |
O(1) | Already printed unconditionally |
tree_count |
subtree_weight<tree_count_ops>::compute_weight_below() |
O(n + m) bottom-up DP | New with this flag |
parsimony_min |
subtree_weight<parsimony_score_ops>::compute_weight_below() |
O(n + m) bottom-up DP | New with this flag |
Skip: parsimony_all distribution, sum_rf_dist_all distribution.
Implementation
All changes are in tools/dagutil.cpp:
-
Add
dag_quick_infofield toargsstruct (line ~237):bool dag_quick_info = false;
-
Add CLI parsing (in
parse_args, after the--dag-infocase around line 279):} else if (arg == "--dag-quick-info") { a.dag_quick_info = true;
Note:
--dag-quick-infoshould NOT setprint_parsimonyorprint_rf_distance. -
Add quick-info analysis block (after the basic stats printout at line 391, before the existing
if (a.dag_info)block). Guard against double-printing when--dag-infois also set, since--dag-infoalready printstree_count:if (a.dag_quick_info && !a.dag_info) { tree_count_ops tc_ops; subtree_weight<tree_count_ops> tc_sw(result, a.seed); auto tree_count = tc_sw.compute_weight_below(root_idx, tc_ops); std::cout << "tree_count: " << tree_count << "\n"; parsimony_score_ops pops; subtree_weight<parsimony_score_ops> psw(result, a.seed); auto pmin = psw.compute_weight_below(root_idx, pops); std::cout << "parsimony_min: " << pmin << "\n"; }
-
Update usage string (line ~211):
Analysis: --dag-info Print all DAG statistics (tree count, parsimony, RF) --dag-quick-info Print fast DAG statistics (tree count, parsimony min) --parsimony Print parsimony score distribution --sum-rf-distance Print sum RF distance distribution -
--dag-infosubsumes--dag-quick-info: If both flags are given, the quick block is skipped (guarded by!a.dag_infoin step 3) and--dag-infoprints its superset of stats as usual. No error, no duplicate output.
Output format
leaves: 296
nodes: 3847
edges: 4102
tree_count: 183729182739182
parsimony_min: 1042
The tree_count line matches the existing --dag-info output exactly. The parsimony_min line uses a simpler format than --dag-info (which prints parsimony_min: score:1042, count:37). The quick-info version omits the count since computing it requires no extra work but the format is intentionally simpler — downstream parsers that extract the first integer after parsimony_min: will work with both formats.
Note on parsimony_min semantics
parsimony_score_ops::within_clade_accum picks the minimum-weight edges at each clade, so compute_weight_below with bare parsimony_score_ops returns the minimum parsimony score across all trees in the DAG. After --trim, this is the score of every remaining tree. This is the same value as the parsimony_min: score:X line from --dag-info, just without the count.
Out of scope for this issue: if the count of minimum-parsimony trees is also wanted cheaply, it can be obtained via subtree_weight::min_weight_count() which is also O(n + m) (it reuses the cached min-weight edges).
Motivation
Needed for experiment workflows (e.g. matsengrp/phyz#795) that measure DAG growth by comparing node/edge counts and tree_count before and after merging new trees. These workflows run dagutil on many DAGs and need results in seconds, not hours.
Test plan
Verify against the included test DAGs:
# Quick-info should complete instantly and print tree_count + parsimony_min
./build/dagutil --dag-quick-info --dag-pb data/madag/fluC_PB2_10taxa.pb --force-no-vcf
# Confirm --dag-info still works and its output is a superset of quick-info
./build/dagutil --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf
# Both flags together: no duplicate tree_count lines
./build/dagutil --dag-quick-info --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf | grep -c 'tree_count:'
# Expected: 1Acceptance criteria
-
dagutil --dag-quick-info --dag-pb foo.pbprints tree_count and parsimony_min (in addition to the unconditional leaves/nodes/edges) - Completes in < 1 second on the
data/madag/fluC_PB2_10taxa.pbtest DAG (and expected to scale linearly to 4000-node DAGs) -
--dag-infostill works unchanged - Both flags together produce no duplicate output lines
- Usage string updated
- Output format is machine-parseable (matches existing
key: valueformat)