Skip to content

Add --dag-quick-info flag to dagutil #25

@matsen

Description

@matsen

issue: 25
title: "Add --dag-quick-info flag to dagutil"

Problem

dagutil --dag-info computes expensive statistics (parsimony score distribution across all trees, pairwise RF distance distribution) that require enumerating the full weight distribution over all trees in the DAG. On production DAGs (194–296 taxa, ~1000–4000 nodes/edges), --dag-info runs for 20+ minutes at 100% CPU without completing.

The cheap stats (leaves, nodes, edges, tree_count, parsimony_min) are bundled together with the expensive distribution computations, so there is no way to get just the fast ones.

Root cause

The cost difference comes from how subtree_weight is instantiated:

  • Cheap: subtree_weight<tree_count_ops> and subtree_weight<parsimony_score_ops> each do a single bottom-up DP pass in O(nodes + edges). tree_count_ops sums alternatives within a clade and multiplies across clades; parsimony_score_ops takes the min within a clade and sums across clades. Both are memoized per-node.
  • Expensive: subtree_weight<weight_accumulator<parsimony_score_ops>> and subtree_weight<weight_accumulator<sum_rf_distance>> track the full distribution of weights. The weight_accumulator wrapper maintains a weight_counter (flat_map from score → count), and the cartesian-product step at each clade can cause combinatorial blowup in map size. This is what takes 20+ minutes.

Proposal

Add a --dag-quick-info flag that prints tree_count and parsimony_min in addition to the basic stats (leaves, nodes, edges) that are already printed unconditionally:

Stat Source Cost Notes
leaves leaf_count(result) O(n) Already printed unconditionally
nodes node_count(result) O(1) Already printed unconditionally
edges edge_count(result) O(1) Already printed unconditionally
tree_count subtree_weight<tree_count_ops>::compute_weight_below() O(n + m) bottom-up DP New with this flag
parsimony_min subtree_weight<parsimony_score_ops>::compute_weight_below() O(n + m) bottom-up DP New with this flag

Skip: parsimony_all distribution, sum_rf_dist_all distribution.

Implementation

All changes are in tools/dagutil.cpp:

  1. Add dag_quick_info field to args struct (line ~237):

    bool dag_quick_info = false;
  2. Add CLI parsing (in parse_args, after the --dag-info case around line 279):

    } else if (arg == "--dag-quick-info") {
      a.dag_quick_info = true;

    Note: --dag-quick-info should NOT set print_parsimony or print_rf_distance.

  3. Add quick-info analysis block (after the basic stats printout at line 391, before the existing if (a.dag_info) block). Guard against double-printing when --dag-info is also set, since --dag-info already prints tree_count:

    if (a.dag_quick_info && !a.dag_info) {
      tree_count_ops tc_ops;
      subtree_weight<tree_count_ops> tc_sw(result, a.seed);
      auto tree_count = tc_sw.compute_weight_below(root_idx, tc_ops);
      std::cout << "tree_count: " << tree_count << "\n";
    
      parsimony_score_ops pops;
      subtree_weight<parsimony_score_ops> psw(result, a.seed);
      auto pmin = psw.compute_weight_below(root_idx, pops);
      std::cout << "parsimony_min: " << pmin << "\n";
    }
  4. Update usage string (line ~211):

    Analysis:
      --dag-info              Print all DAG statistics (tree count, parsimony, RF)
      --dag-quick-info        Print fast DAG statistics (tree count, parsimony min)
      --parsimony             Print parsimony score distribution
      --sum-rf-distance       Print sum RF distance distribution
    
  5. --dag-info subsumes --dag-quick-info: If both flags are given, the quick block is skipped (guarded by !a.dag_info in step 3) and --dag-info prints its superset of stats as usual. No error, no duplicate output.

Output format

leaves: 296
nodes: 3847
edges: 4102
tree_count: 183729182739182
parsimony_min: 1042

The tree_count line matches the existing --dag-info output exactly. The parsimony_min line uses a simpler format than --dag-info (which prints parsimony_min: score:1042, count:37). The quick-info version omits the count since computing it requires no extra work but the format is intentionally simpler — downstream parsers that extract the first integer after parsimony_min: will work with both formats.

Note on parsimony_min semantics

parsimony_score_ops::within_clade_accum picks the minimum-weight edges at each clade, so compute_weight_below with bare parsimony_score_ops returns the minimum parsimony score across all trees in the DAG. After --trim, this is the score of every remaining tree. This is the same value as the parsimony_min: score:X line from --dag-info, just without the count.

Out of scope for this issue: if the count of minimum-parsimony trees is also wanted cheaply, it can be obtained via subtree_weight::min_weight_count() which is also O(n + m) (it reuses the cached min-weight edges).

Motivation

Needed for experiment workflows (e.g. matsengrp/phyz#795) that measure DAG growth by comparing node/edge counts and tree_count before and after merging new trees. These workflows run dagutil on many DAGs and need results in seconds, not hours.

Test plan

Verify against the included test DAGs:

# Quick-info should complete instantly and print tree_count + parsimony_min
./build/dagutil --dag-quick-info --dag-pb data/madag/fluC_PB2_10taxa.pb --force-no-vcf

# Confirm --dag-info still works and its output is a superset of quick-info
./build/dagutil --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf

# Both flags together: no duplicate tree_count lines
./build/dagutil --dag-quick-info --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf | grep -c 'tree_count:'
# Expected: 1

Acceptance criteria

  • dagutil --dag-quick-info --dag-pb foo.pb prints tree_count and parsimony_min (in addition to the unconditional leaves/nodes/edges)
  • Completes in < 1 second on the data/madag/fluC_PB2_10taxa.pb test DAG (and expected to scale linearly to 4000-node DAGs)
  • --dag-info still works unchanged
  • Both flags together produce no duplicate output lines
  • Usage string updated
  • Output format is machine-parseable (matches existing key: value format)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions