Add --dag-quick-info flag to dagutil

---
issue: 25
title: "Add --dag-quick-info flag to dagutil"
---

## Problem

`dagutil --dag-info` computes expensive statistics (parsimony score distribution across all trees, pairwise RF distance distribution) that require enumerating the full weight distribution over all trees in the DAG. On production DAGs (194–296 taxa, ~1000–4000 nodes/edges), `--dag-info` runs for 20+ minutes at 100% CPU without completing.

The cheap stats (leaves, nodes, edges, tree_count, parsimony_min) are bundled together with the expensive distribution computations, so there is no way to get just the fast ones.

### Root cause

The cost difference comes from how `subtree_weight` is instantiated:

- **Cheap:** `subtree_weight<tree_count_ops>` and `subtree_weight<parsimony_score_ops>` each do a single bottom-up DP pass in O(nodes + edges). `tree_count_ops` sums alternatives within a clade and multiplies across clades; `parsimony_score_ops` takes the min within a clade and sums across clades. Both are memoized per-node.
- **Expensive:** `subtree_weight<weight_accumulator<parsimony_score_ops>>` and `subtree_weight<weight_accumulator<sum_rf_distance>>` track the *full distribution* of weights. The `weight_accumulator` wrapper maintains a `weight_counter` (flat_map from score → count), and the cartesian-product step at each clade can cause combinatorial blowup in map size. This is what takes 20+ minutes.

## Proposal

Add a `--dag-quick-info` flag that prints `tree_count` and `parsimony_min` in addition to the basic stats (leaves, nodes, edges) that are already printed unconditionally:

| Stat | Source | Cost | Notes |
|------|--------|------|-------|
| `leaves` | `leaf_count(result)` | O(n) | Already printed unconditionally |
| `nodes` | `node_count(result)` | O(1) | Already printed unconditionally |
| `edges` | `edge_count(result)` | O(1) | Already printed unconditionally |
| `tree_count` | `subtree_weight<tree_count_ops>::compute_weight_below()` | O(n + m) bottom-up DP | **New with this flag** |
| `parsimony_min` | `subtree_weight<parsimony_score_ops>::compute_weight_below()` | O(n + m) bottom-up DP | **New with this flag** |

Skip: `parsimony_all` distribution, `sum_rf_dist_all` distribution.

### Implementation

All changes are in `tools/dagutil.cpp`:

1. **Add `dag_quick_info` field to `args` struct** (line ~237):
   ```cpp
   bool dag_quick_info = false;
   ```

2. **Add CLI parsing** (in `parse_args`, after the `--dag-info` case around line 279):
   ```cpp
   } else if (arg == "--dag-quick-info") {
     a.dag_quick_info = true;
   ```
   Note: `--dag-quick-info` should NOT set `print_parsimony` or `print_rf_distance`.

3. **Add quick-info analysis block** (after the basic stats printout at line 391, before the existing `if (a.dag_info)` block). Guard against double-printing when `--dag-info` is also set, since `--dag-info` already prints `tree_count`:
   ```cpp
   if (a.dag_quick_info && !a.dag_info) {
     tree_count_ops tc_ops;
     subtree_weight<tree_count_ops> tc_sw(result, a.seed);
     auto tree_count = tc_sw.compute_weight_below(root_idx, tc_ops);
     std::cout << "tree_count: " << tree_count << "\n";

     parsimony_score_ops pops;
     subtree_weight<parsimony_score_ops> psw(result, a.seed);
     auto pmin = psw.compute_weight_below(root_idx, pops);
     std::cout << "parsimony_min: " << pmin << "\n";
   }
   ```

4. **Update usage string** (line ~211):
   ```
   Analysis:
     --dag-info              Print all DAG statistics (tree count, parsimony, RF)
     --dag-quick-info        Print fast DAG statistics (tree count, parsimony min)
     --parsimony             Print parsimony score distribution
     --sum-rf-distance       Print sum RF distance distribution
   ```

5. **`--dag-info` subsumes `--dag-quick-info`**: If both flags are given, the quick block is skipped (guarded by `!a.dag_info` in step 3) and `--dag-info` prints its superset of stats as usual. No error, no duplicate output.

### Output format

```
leaves: 296
nodes: 3847
edges: 4102
tree_count: 183729182739182
parsimony_min: 1042
```

The `tree_count` line matches the existing `--dag-info` output exactly. The `parsimony_min` line uses a simpler format than `--dag-info` (which prints `parsimony_min: score:1042, count:37`). The quick-info version omits the count since computing it requires no extra work but the format is intentionally simpler — downstream parsers that extract the first integer after `parsimony_min:` will work with both formats.

### Note on `parsimony_min` semantics

`parsimony_score_ops::within_clade_accum` picks the minimum-weight edges at each clade, so `compute_weight_below` with bare `parsimony_score_ops` returns the minimum parsimony score across all trees in the DAG. After `--trim`, this is the score of every remaining tree. This is the same value as the `parsimony_min: score:X` line from `--dag-info`, just without the count.

Out of scope for this issue: if the count of minimum-parsimony trees is also wanted cheaply, it can be obtained via `subtree_weight::min_weight_count()` which is also O(n + m) (it reuses the cached min-weight edges).

## Motivation

Needed for experiment workflows (e.g. matsengrp/phyz#795) that measure DAG growth by comparing node/edge counts and tree_count before and after merging new trees. These workflows run dagutil on many DAGs and need results in seconds, not hours.

## Test plan

Verify against the included test DAGs:

```bash
# Quick-info should complete instantly and print tree_count + parsimony_min
./build/dagutil --dag-quick-info --dag-pb data/madag/fluC_PB2_10taxa.pb --force-no-vcf

# Confirm --dag-info still works and its output is a superset of quick-info
./build/dagutil --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf

# Both flags together: no duplicate tree_count lines
./build/dagutil --dag-quick-info --dag-info --dag-pb data/madag/fluC_PB2_7taxa.pb --force-no-vcf | grep -c 'tree_count:'
# Expected: 1
```

## Acceptance criteria

- [ ] `dagutil --dag-quick-info --dag-pb foo.pb` prints tree_count and parsimony_min (in addition to the unconditional leaves/nodes/edges)
- [ ] Completes in < 1 second on the `data/madag/fluC_PB2_10taxa.pb` test DAG (and expected to scale linearly to 4000-node DAGs)
- [ ] `--dag-info` still works unchanged
- [ ] Both flags together produce no duplicate output lines
- [ ] Usage string updated
- [ ] Output format is machine-parseable (matches existing `key: value` format)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --dag-quick-info flag to dagutil #25

issue: 25
title: "Add --dag-quick-info flag to dagutil"

Problem

Root cause

Proposal

Implementation

Output format

Note on `parsimony_min` semantics

Motivation

Test plan

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stat	Source	Cost	Notes
`leaves`	`leaf_count(result)`	O(n)	Already printed unconditionally
`nodes`	`node_count(result)`	O(1)	Already printed unconditionally
`edges`	`edge_count(result)`	O(1)	Already printed unconditionally
`tree_count`	`subtree_weight<tree_count_ops>::compute_weight_below()`	O(n + m) bottom-up DP	New with this flag
`parsimony_min`	`subtree_weight<parsimony_score_ops>::compute_weight_below()`	O(n + m) bottom-up DP	New with this flag

Add --dag-quick-info flag to dagutil #25

Description

issue: 25 title: "Add --dag-quick-info flag to dagutil"

Problem

Root cause

Proposal

Implementation

Output format

Note on parsimony_min semantics

Motivation

Test plan

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

issue: 25
title: "Add --dag-quick-info flag to dagutil"

Note on `parsimony_min` semantics