|
| 1 | +# Finalize covdb scaling plan (535k samples) |
| 2 | + |
| 3 | +## Purpose |
| 4 | +This document summarizes the current behavior of `determine_hom_refs_from_covdb`, the scaling bottlenecks observed at very large sample counts, and the recommended changes that make the step tractable for ~535k samples while preserving semantics. |
| 5 | + |
| 6 | +## Context |
| 7 | +The finalize step updates a sparse mtDNA MatrixTable by using per‑sample coverage from `coverage.h5` to distinguish between **missing** and **hom‑ref** genotypes. The inputs are extremely sparse (observed missing fraction ~0.9993), which means any strategy that scans the entire entry space repeatedly or attempts per‑entry random lookups will not scale. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## What the current code does (summary) |
| 12 | +The current `determine_hom_refs_from_covdb` implementation: |
| 13 | + |
| 14 | +1. Maps each MT sample to a row index in `coverage.h5` and each MT position to a column index in `coverage.h5`. |
| 15 | +2. Splits positions into blocks (`position_block_size`). |
| 16 | +3. For each block: |
| 17 | + - Reads coverage for **all samples** × **positions in the block** from HDF5. |
| 18 | + - Builds a per‑block literal and annotates entries with `__cov` using: |
| 19 | + - `mt = mt.annotate_entries(__cov=hl.if_else(mt.__block == block_id, cov_expr, mt.__cov))` |
| 20 | +4. After all blocks, applies the hom‑ref logic: |
| 21 | + - If `HL` missing and `DP` > threshold → set `HL=0`, `FT=PASS`, `DP=coverage`. |
| 22 | + - Otherwise keep missing. |
| 23 | + |
| 24 | +### Why this becomes slow |
| 25 | +The per‑block `annotate_entries` updates are evaluated **across all entries** each time, even though only one block should change. With ~16–20 blocks, this effectively multiplies entry‑level work ~16–20×. This behavior is the main scaling problem at 535k samples. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Recommended plan (scalable approach) |
| 30 | +### A) Keep the same semantics, but avoid repeated full‑MT entry scans |
| 31 | +**Key change:** apply entry annotations only to the rows in the current block, then recombine. |
| 32 | + |
| 33 | +**High‑level pattern:** |
| 34 | + |
| 35 | +- Compute `__block` once per row. |
| 36 | +- For each block: |
| 37 | + - `mt_b = mt.filter_rows(mt.__block == block_id)` |
| 38 | + - Compute coverage for that block and annotate **only** `mt_b` entries. |
| 39 | + - Apply hom‑ref logic on `mt_b`. |
| 40 | + - Checkpoint `mt_b`. |
| 41 | +- Union blocks via a **small fan‑in tree** (not a long linear chain). |
| 42 | + |
| 43 | +This preserves sparsity and ensures each row/entry is processed **once**, not once per block. |
| 44 | + |
| 45 | +### B) Remove unnecessary shuffles |
| 46 | +The current `mt.repartition(n_blocks, shuffle=True)` forces a global shuffle. In the block‑local pattern it is not needed and should be removed or replaced with a non‑shuffle repartition only when required for output sizing. |
| 47 | + |
| 48 | +### C) Avoid global `__cov` entry field |
| 49 | +Only create `__cov` inside each block MT (`mt_b`). This prevents inflating the entry schema for the full MT and reduces IR size. |
| 50 | + |
| 51 | +### D) Combine with **sample sharding** |
| 52 | +At 535k samples, even a perfect block‑local refactor can still be heavy. The strongest scaling improvement comes from **sharding by samples** and running finalize per shard, then `union_cols` the shards and compute cohort statistics afterwards. |
| 53 | + |
| 54 | +Recommended structure: |
| 55 | + |
| 56 | +1. **Split columns** into N shards (e.g., 20–24). |
| 57 | +2. For each shard: |
| 58 | + - Run block‑local hom‑ref imputation (A–C above). |
| 59 | +3. **Union columns** across shards. |
| 60 | +4. Run cohort‑wide row annotations (AC/AN/AF, histograms, hap/pop splits) once. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## Why this scales to 535k samples |
| 65 | +1. **No repeated full‑matrix entry scans**: each row/entry is processed once. |
| 66 | +2. **Sparse‑preserving**: we never densify the MT; missing entries stay missing unless coverage supports hom‑ref. |
| 67 | +3. **Reduced driver pressure**: block literals stay small and per‑shard. |
| 68 | +4. **Parallelism**: sample shards scale horizontally; each shard’s workload is smaller and independent. |
| 69 | +5. **Correctness preserved**: the hom‑ref logic is unchanged; only the execution plan is optimized. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Before vs. After (behavioral differences) |
| 74 | +| Aspect | Current code | Recommended code | |
| 75 | +|---|---|---| |
| 76 | +| Entry‑level evaluation | Full MT per block | Only rows in block | |
| 77 | +| Shuffling | Global shuffle (`shuffle=True`) | Removed or minimized | |
| 78 | +| `__cov` creation | Whole MT | Block‑local only | |
| 79 | +| Recombination | Single MT updated per block | Union of block MTs (fan‑in) | |
| 80 | +| Scaling | Work ~ (#blocks × all entries) | Work ~ (all entries once) | |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Final notes |
| 85 | +- The hom‑ref logic is **identical** to current behavior. |
| 86 | +- The plan avoids any “lazy per‑entry lookup” that would cause billions of random HDF5 reads. |
| 87 | +- The strategy remains compatible with downstream `add_annotations` and other cohort‑wide statistics, which should be computed **after** unioning shards. |
| 88 | + |
| 89 | +If needed, a follow‑on doc can include concrete code changes or a WDL wiring plan for shard/fan‑in orchestration. |
0 commit comments