|
| 1 | +# Assembly Regression Testing |
| 2 | + |
| 3 | +End-to-end regression testing for assembly pipeline changes against Terra submissions. |
| 4 | + |
| 5 | +## When to Use |
| 6 | + |
| 7 | +Use this playbook when a PR makes functional changes to the assembly or variant-calling |
| 8 | +pipeline (e.g., swapping variant callers, changing alignment parameters, modifying |
| 9 | +consensus logic). It compares assembly outputs from old vs new code across hundreds |
| 10 | +of real samples to validate equivalence or improvement. |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +- **gcloud CLI** -- authenticated with access to Terra workspace GCS buckets |
| 15 | +- **mafft** -- for pairwise sequence alignment |
| 16 | +- **Python** with pandas and matplotlib (e.g., a dataviz venv) |
| 17 | +- **dsub** -- for running VADR batch jobs on GCP (see the `dsub-batch-jobs` skill) |
| 18 | + |
| 19 | +## Workflow |
| 20 | + |
| 21 | +### Step 1: Set Up Terra Submissions (Manual) |
| 22 | + |
| 23 | +The user must manually launch Terra submissions with old and new code: |
| 24 | +1. Run the pipeline on a representative dataset using the **main branch** Docker image |
| 25 | +2. Run the same pipeline on the same dataset using the **feature branch** Docker image |
| 26 | +3. Note the submission IDs and workspace bucket for both runs |
| 27 | + |
| 28 | +### Step 2: Discover Paired Samples |
| 29 | + |
| 30 | +Use `discover_pairs.py` to find all comparable old/new sample pairs by crawling |
| 31 | +GCS Cromwell output directories. |
| 32 | + |
| 33 | +```bash |
| 34 | +python discover_pairs.py \ |
| 35 | + --bucket <workspace-bucket-id> \ |
| 36 | + --old-sub <old-submission-id> \ |
| 37 | + --new-sub <new-submission-id> \ |
| 38 | + --output pairs.json |
| 39 | +``` |
| 40 | + |
| 41 | +This produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples |
| 42 | +present in both submissions. |
| 43 | + |
| 44 | +### Step 3: Compare Assembly Outputs |
| 45 | + |
| 46 | +Use `compare_sample_pair.py` to compare each sample pair. This script: |
| 47 | +- Downloads assembly_stats TSVs from GCS |
| 48 | +- Compares metrics (coverage, % reference covered, length, etc.) |
| 49 | +- Downloads FASTA assemblies and aligns them with mafft |
| 50 | +- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions |
| 51 | + |
| 52 | +```bash |
| 53 | +python compare_sample_pair.py \ |
| 54 | + --old-tsv <gcs_uri> --new-tsv <gcs_uri> \ |
| 55 | + --work-dir ./results/<sample> \ |
| 56 | + --output-json ./results/<sample>.json |
| 57 | +``` |
| 58 | + |
| 59 | +For batch processing, iterate over all entries in `pairs.json` and invoke |
| 60 | +`compare_sample_pair.py` for each sample pair (e.g., via a small wrapper |
| 61 | +script using `concurrent.futures` or `xargs`/GNU `parallel`). |
| 62 | + |
| 63 | +### Step 4: Generate Report |
| 64 | + |
| 65 | +Use `generate_report.py` to aggregate all per-sample JSONs into plots and a markdown report. |
| 66 | + |
| 67 | +```bash |
| 68 | +python generate_report.py \ |
| 69 | + --results-dir ./results/ \ |
| 70 | + --report-dir ./report/ \ |
| 71 | + --workspace-name <name> |
| 72 | +``` |
| 73 | + |
| 74 | +Outputs: |
| 75 | +- Summary TSV with per-assembly metrics |
| 76 | +- 8 plots (scatter plots, histograms, identity distribution) |
| 77 | +- Markdown report with summary tables and divergent assembly details |
| 78 | + |
| 79 | +### Step 5: (Optional) VADR Annotation Quality |
| 80 | + |
| 81 | +For assemblies with internal indel differences, run VADR to assess whether indels |
| 82 | +cause frameshifts or other annotation problems. See the `dsub-batch-jobs` skill for |
| 83 | +details on running batch jobs via dsub. |
| 84 | + |
| 85 | +Use `run_vadr.sh` with dsub to run VADR on each FASTA: |
| 86 | + |
| 87 | +```bash |
| 88 | +dsub --provider google-cls-v2 \ |
| 89 | + --project <gcp-project> --regions us-central1 \ |
| 90 | + --image staphb/vadr:1.6.4 \ |
| 91 | + --machine-type n1-highmem-4 \ |
| 92 | + --script run_vadr.sh \ |
| 93 | + --tasks vadr_tasks.tsv \ |
| 94 | + --logging gs://<bucket>/vadr_logs/ |
| 95 | +``` |
| 96 | + |
| 97 | +VADR model parameters come from the viral-references repo: |
| 98 | +https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv |
| 99 | + |
| 100 | +Use the taxid from the assembly_id (format: `sample_id-taxid`) to look up the |
| 101 | +correct `vadr_opts`, `min_seq_len`, `max_seq_len`, `vadr_model_tar_url`, and |
| 102 | +`vadr_model_tar_subdir`. |
| 103 | + |
| 104 | +### Step 6: Post Results |
| 105 | + |
| 106 | +Post the report as a PR comment. Before posting: |
| 107 | +- **Self-review the proposed comment for confidential information** (sample names, |
| 108 | + internal paths, credentials, etc.). Ask the user if in doubt about what is safe |
| 109 | + to share publicly. |
| 110 | +- Include plots as image attachments if the PR is on GitHub |
| 111 | +- Attribute the analysis appropriately |
| 112 | + |
| 113 | +## Key Patterns |
| 114 | + |
| 115 | +### Per-Segment Alignment |
| 116 | + |
| 117 | +Multi-segment genomes (e.g., influenza with 8 segments) must be aligned |
| 118 | +**per-segment**, not as a single concatenated sequence. Otherwise, terminal |
| 119 | +effects at segment boundaries get misclassified as internal indels. |
| 120 | + |
| 121 | +The `compare_sample_pair.py` script handles this automatically: it pairs |
| 122 | +segments by FASTA header, aligns each pair independently, analyzes each |
| 123 | +alignment separately (so terminal effects stay terminal), and aggregates |
| 124 | +the statistics. |
| 125 | + |
| 126 | +### Event Counting vs BP Counting |
| 127 | + |
| 128 | +For indels, both counts matter: |
| 129 | +- **BP count**: Total gap positions (e.g., "49 bp of indels") |
| 130 | +- **Event count**: Contiguous gap runs (e.g., "13 indel events") |
| 131 | + |
| 132 | +A single 26bp insertion is 1 event but 26 bp. Event counts better reflect |
| 133 | +the number of variant-calling decisions that differ between old and new code. |
| 134 | + |
| 135 | +### VADR Frameshift Cascade Detection |
| 136 | + |
| 137 | +A single spurious 1bp indel in a coding region causes a cascade of VADR alerts: |
| 138 | +1. `FRAMESHIFT` -- the indel shifts the reading frame |
| 139 | +2. `STOP_CODON` -- premature stop codon in the shifted frame |
| 140 | +3. `UNEXPECTED_LENGTH` -- protein length doesn't match model |
| 141 | +4. `PEPTIDE_TRANSLATION_PROBLEM` -- for each downstream mature peptide |
| 142 | + |
| 143 | +When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means |
| 144 | +one version has frameshift-causing indels that the other avoids. Check the |
| 145 | +`.alt.list` files to confirm which genes are affected. |
| 146 | + |
| 147 | +## Interpreting Results |
| 148 | + |
| 149 | +### What to Look For |
| 150 | + |
| 151 | +1. **Identity distribution**: Most assemblies should be 100% identical. Any |
| 152 | + below 99.9% warrant investigation. |
| 153 | +2. **SNP count = 0 for all assemblies**: Pipeline changes that only affect |
| 154 | + indel calling (e.g., swapping variant callers) should produce zero SNPs. |
| 155 | +3. **Indel events**: The number and nature of indel differences. Are they in |
| 156 | + coding regions? Do they cause frameshifts? |
| 157 | +4. **Coverage correlation**: Low-coverage samples (<10x) are most likely to |
| 158 | + show differences between variant callers. |
| 159 | +5. **VADR alert deltas**: Fewer alerts = more biologically plausible assembly. |
| 160 | + Large improvements (e.g., -31 alerts) strongly favor the new code. |
| 161 | + |
| 162 | +### Red Flags |
| 163 | + |
| 164 | +- Assemblies present in old but missing in new (or vice versa) |
| 165 | +- SNPs introduced where none existed before |
| 166 | +- VADR alerts increasing significantly for the new code |
| 167 | +- Differences concentrated in specific organisms/taxids |
0 commit comments