End-to-end regression testing for assembly pipeline changes against Terra submissions.
Use this playbook when a PR makes functional changes to the assembly or variant-calling pipeline (e.g., swapping variant callers, changing alignment parameters, modifying consensus logic). It compares assembly outputs from old vs new code across hundreds of real samples to validate equivalence or improvement.
- gcloud CLI -- authenticated with access to Terra workspace GCS buckets
- mafft -- for pairwise sequence alignment
- Python with pandas and matplotlib (e.g., a dataviz venv)
- dsub -- for running VADR batch jobs on GCP (see the
dsub-batch-jobsskill)
The user must manually launch Terra submissions with old and new code:
- Run the pipeline on a representative dataset using the main branch Docker image
- Run the same pipeline on the same dataset using the feature branch Docker image
- Note the submission IDs and workspace bucket for both runs
Use discover_pairs.py to find all comparable old/new sample pairs by crawling
GCS Cromwell output directories.
python discover_pairs.py \
--bucket <workspace-bucket-id> \
--old-sub <old-submission-id> \
--new-sub <new-submission-id> \
--output pairs.jsonThis produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples present in both submissions.
Use compare_sample_pair.py to compare each sample pair. This script:
- Downloads assembly_stats TSVs from GCS
- Compares metrics (coverage, % reference covered, length, etc.)
- Downloads FASTA assemblies and aligns them with mafft
- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions
python compare_sample_pair.py \
--old-tsv <gcs_uri> --new-tsv <gcs_uri> \
--work-dir ./results/<sample> \
--output-json ./results/<sample>.jsonFor batch processing, iterate over all entries in pairs.json and invoke
compare_sample_pair.py for each sample pair (e.g., via a small wrapper
script using concurrent.futures or xargs/GNU parallel).
Use generate_report.py to aggregate all per-sample JSONs into plots and a markdown report.
python generate_report.py \
--results-dir ./results/ \
--report-dir ./report/ \
--workspace-name <name>Outputs:
- Summary TSV with per-assembly metrics
- 8 plots (scatter plots, histograms, identity distribution)
- Markdown report with summary tables and divergent assembly details
For assemblies with internal indel differences, run VADR to assess whether indels
cause frameshifts or other annotation problems. See the dsub-batch-jobs skill for
details on running batch jobs via dsub.
Use run_vadr.sh with dsub to run VADR on each FASTA:
dsub --provider google-cls-v2 \
--project <gcp-project> --regions us-central1 \
--image staphb/vadr:1.6.4 \
--machine-type n1-highmem-4 \
--script run_vadr.sh \
--tasks vadr_tasks.tsv \
--logging gs://<bucket>/vadr_logs/VADR model parameters come from the viral-references repo: https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv
Use the taxid from the assembly_id (format: sample_id-taxid) to look up the
correct vadr_opts, min_seq_len, max_seq_len, vadr_model_tar_url, and
vadr_model_tar_subdir.
Post the report as a PR comment. Before posting:
- Self-review the proposed comment for confidential information (sample names, internal paths, credentials, etc.). Ask the user if in doubt about what is safe to share publicly.
- Include plots as image attachments if the PR is on GitHub
- Attribute the analysis appropriately
Multi-segment genomes (e.g., influenza with 8 segments) must be aligned per-segment, not as a single concatenated sequence. Otherwise, terminal effects at segment boundaries get misclassified as internal indels.
The compare_sample_pair.py script handles this automatically: it pairs
segments by FASTA header, aligns each pair independently, analyzes each
alignment separately (so terminal effects stay terminal), and aggregates
the statistics.
For indels, both counts matter:
- BP count: Total gap positions (e.g., "49 bp of indels")
- Event count: Contiguous gap runs (e.g., "13 indel events")
A single 26bp insertion is 1 event but 26 bp. Event counts better reflect the number of variant-calling decisions that differ between old and new code.
A single spurious 1bp indel in a coding region causes a cascade of VADR alerts:
FRAMESHIFT-- the indel shifts the reading frameSTOP_CODON-- premature stop codon in the shifted frameUNEXPECTED_LENGTH-- protein length doesn't match modelPEPTIDE_TRANSLATION_PROBLEM-- for each downstream mature peptide
When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means
one version has frameshift-causing indels that the other avoids. Check the
.alt.list files to confirm which genes are affected.
- Identity distribution: Most assemblies should be 100% identical. Any below 99.9% warrant investigation.
- SNP count = 0 for all assemblies: Pipeline changes that only affect indel calling (e.g., swapping variant callers) should produce zero SNPs.
- Indel events: The number and nature of indel differences. Are they in coding regions? Do they cause frameshifts?
- Coverage correlation: Low-coverage samples (<10x) are most likely to show differences between variant callers.
- VADR alert deltas: Fewer alerts = more biologically plausible assembly. Large improvements (e.g., -31 alerts) strongly favor the new code.
- Assemblies present in old but missing in new (or vice versa)
- SNPs introduced where none existed before
- VADR alerts increasing significantly for the new code
- Differences concentrated in specific organisms/taxids