broadinstitute
diff --git a/‎.agents/skills/container-vulns/SKILL.md‎
Lines changed: 69 additions & 0 deletions b/‎.agents/skills/container-vulns/SKILL.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎.agents/skills/dsub-batch-jobs/SKILL.md‎
Lines changed: 126 additions & 0 deletions b/‎.agents/skills/dsub-batch-jobs/SKILL.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎.agents/skills/regression-testing/SKILL.md‎
Lines changed: 167 additions & 0 deletions b/‎.agents/skills/regression-testing/SKILL.md‎
Lines changed: 167 additions & 0 deletions
@@ -0,0 +1,69 @@
+# Container Vulnerability Management
+
+Guidance for scanning, triaging, and mitigating container image vulnerabilities
+in the viral-ngs Docker image hierarchy.
+
+## Scanning
+
+Container images are scanned for vulnerabilities using [Trivy](https://aquasecurity.github.io/trivy/):
+
+- **On every PR/push**: `docker.yml` scans each image flavor after build (SARIF -> GitHub Security tab, JSON -> artifact)
+- **Weekly schedule**: `container-scan.yml` scans the latest published images
+- Scans filter to **CRITICAL/HIGH** severity, **ignore-unfixed**, and apply a Rego policy (`.trivy-ignore-policy.rego`)
+- Per-CVE exceptions go in `.trivyignore` with mandatory justification comments
+
+## Rego Policy (`.trivy-ignore-policy.rego`)
+
+The Rego policy filters CVEs that are architecturally inapplicable to ephemeral batch containers:
+
+- **AV:P** (Physical access required) -- containers are cloud-hosted
+- **AV:A** (Adjacent network required) -- no attacker on same network segment
+- **AV:L + UI:R** (Local + user interaction) -- no interactive sessions
+- **AV:L + PR:H** (Local + high privileges) -- containers run non-root
+- **AV:L + S:U** (Local + scope unchanged) -- attacker already has code execution and impact stays within the ephemeral container
+
+Changes to this policy should be reviewed carefully. The comments in the file explain the rationale and risk for each rule.
+
+## Common Vulnerability Sources
+
+**Python transitive deps**: Pin minimum versions in `docker/requirements/*.txt`. Prefer conda packages over pip. Check conda-forge availability before assuming a version exists -- conda-forge often lags PyPI by days/weeks.
+
+**Java fat JARs** (picard, gatk, snpeff, fgbio): Bioinformatics Java tools are distributed as uber JARs with all dependencies bundled inside. Trivy detects vulnerable libraries (log4j, commons-compress, etc.) baked into these JARs. Version bumps can cause ARM64 conda solver conflicts because Java tools pull in openjdk -> harfbuzz -> icu version chains that clash with other packages (r-base, boost-cpp, pyicu). Always check:
+1. Whether the tool is actually flagged by Trivy (don't bump versions unnecessarily)
+2. Whether the CVE applies (e.g., log4j 1.x is NOT vulnerable to Log4Shell)
+3. Whether the desired version resolves on ARM64 before pushing
+
+**Go binaries**: Some conda packages bundle compiled Go binaries (e.g., mafft's `dash_client`, google-cloud-sdk's `gcloud-crc32c`). If the binary is unused, delete it in the Dockerfile. Delete from **both** the installed location and `/opt/conda/pkgs/*/` (conda package cache) -- Trivy scans the full filesystem.
+
+**Vendored copies**: Packages like google-cloud-sdk and setuptools bundle their own copies of Python libraries that may be older than what's in the conda environment. Trivy flags these vendored copies separately. Options: delete the vendored directory (if not needed at runtime), or accept the risk in `.trivyignore` with justification.
+
+## ARM64 Solver Conflicts
+
+The conda solver on ARM64 (linux-aarch64) is more constrained than amd64 because fewer package builds exist. Common conflict patterns:
+
+- **icu version conflicts**: Many packages (openjdk, r-base, boost-cpp, pyicu) pin specific icu version ranges. Bumping one package can make the entire environment unsolvable.
+- **libdeflate/htslib conflicts**: lofreq 2.1.5 pins old htslib/libdeflate versions that conflict with newer pillow/libtiff.
+- **openjdk version escalation**: snpeff 5.2+ requires openjdk>=11, 5.3+ requires openjdk>=21. Higher openjdk versions pull in harfbuzz->icu chains that conflict with everything.
+
+When a solver conflict occurs: revert the change, check what version the solver was picking before, and pin to that exact version if it already addresses the CVE.
+
+## Mitigation Decision Process
+
+When triaging a CVE:
+
+1. **Check the CVSS vector** -- does the Rego policy already filter it?
+2. **Identify the source package** -- use Trivy JSON output (`PkgName`, `PkgPath`, `InstalledVersion`)
+3. **Check if a fix version exists on conda-forge/bioconda** -- not just on PyPI
+4. **Test on ARM64** -- solver conflicts are the most common failure mode
+5. **If the fix version conflicts**: consider whether the CVE is exploitable in your deployment model. Document the risk assessment in `.trivyignore` or `vulnerability-mitigation-status.md`.
+6. **If the vulnerable code is unused**: delete the binary/file inline in the Dockerfile (same RUN layer as install to avoid bloating images)
+
+## Key Files
+
+| File | Purpose |
+|------|---------|
+| `.trivy-ignore-policy.rego` | Rego policy for class-level CVE filtering |
+| `.trivyignore` | Per-CVE exceptions with justifications |
+| `.github/workflows/docker.yml` | Build-time scanning (SARIF + JSON) |
+| `.github/workflows/container-scan.yml` | Weekly scheduled scanning |
+| `vulnerability-mitigation-status.md` | Local-only tracking doc (not committed) |
@@ -0,0 +1,126 @@
+# Running Batch Jobs on GCP via dsub
+
+Use dsub to run one-off compute jobs on Google Cloud when your analysis requires
+more compute/memory than the local environment, or needs specific Docker images
+that are impractical to run locally.
+
+## When to Use
+
+- Analysis tools need >8GB RAM (e.g., VADR, BLAST, assembly)
+- Need to run many independent jobs in parallel (batch processing)
+- Need a specific Docker image with pre-installed tools
+- Data lives in GCS and is most efficiently processed in-cloud
+
+## Prerequisites
+
+- **dsub** installed (ask the user where their dsub installation or venv is located)
+- **gcloud CLI** authenticated with a GCP project that has Batch API enabled
+- **GCS bucket** accessible by the project's default service account
+
+## Generic Invocation
+
+```bash
+dsub --provider google-cls-v2 \
+  --project <gcp-project> \
+  --regions <region> \
+  --image <docker-image> \
+  --machine-type <machine-type> \
+  --script <script.sh> \
+  --tasks <tasks.tsv> \
+  --logging gs://<bucket>/logs/<job-name>/
+```
+
+### Key Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `--provider google-cls-v2` | Use Google Cloud Batch (recommended over Life Sciences) |
+| `--project` | GCP project with Batch API enabled |
+| `--regions` | Compute region (e.g., `us-central1`) |
+| `--image` | Docker image to run (e.g., `staphb/vadr:1.6.4`) |
+| `--machine-type` | VM type (e.g., `n1-highmem-4` for 26GB RAM) |
+| `--script` | Local shell script to execute inside the container |
+| `--tasks` | TSV file defining one row per job (batch mode) |
+| `--logging` | GCS path for stdout/stderr logs |
+
+## Task TSV Format
+
+The tasks TSV defines inputs, outputs, and environment variables for each job.
+Header row uses column prefixes to declare types:
+
+```
+--env VAR1	--env VAR2	--input FASTA	--output RESULT	--output LOG
+value1	value2	gs://bucket/input.fasta	gs://bucket/output.txt	gs://bucket/log.txt
+```
+
+- `--env NAME` -- environment variable passed to the script
+- `--input NAME` -- GCS file downloaded to a local path; the env var is set to the local path
+- `--output NAME` -- local path; after the script finishes, the file is uploaded to GCS
+
+Each non-header row is one job. All jobs run independently in parallel.
+
+## GCP Project and Bucket Scoping
+
+The service account running dsub jobs must have read/write access to all GCS paths
+referenced in the tasks TSV. The simplest approach:
+
+1. Use a GCP project whose default service account already has access to your data
+2. Use a bucket within that same project for staging intermediate/output files
+3. For ephemeral results, use a temp bucket with a lifecycle policy (e.g., 30-day auto-delete)
+
+### Broad Viral Genomics Defaults
+
+Most developers on the viral-ngs codebase use:
+- **GCP project**: `gcid-viral-seq`
+- **Staging bucket**: `gs://viral-temp-30d` (30-day auto-delete lifecycle)
+
+These are not universal -- always confirm with the user before using them.
+
+## Monitoring Jobs
+
+```bash
+# Check job status
+dsub --provider google-cls-v2 --project <project> --jobs <job-id> --status
+
+# Or use dstat
+dstat --provider google-cls-v2 --project <project> --jobs <job-id> --status '*'
+
+# View logs
+gcloud storage cat gs://<bucket>/logs/<job-name>/<task-id>.log
+```
+
+## Tips
+
+- **Batch over single jobs**: Always prefer `--tasks` with a TSV over individual
+  `dsub` invocations. One TSV row per job is cleaner and easier to track.
+- **Machine sizing**: Check your tool's memory requirements. VADR needs ~16GB;
+  use `n1-highmem-4` (26GB). Most tools work fine with `n1-standard-4` (15GB).
+- **Script portability**: Write the `--script` to be self-contained. It receives
+  inputs/outputs as environment variables. Don't assume any local state.
+- **Logging**: Always set `--logging` to a GCS path so you can debug failures.
+- **Idempotency**: If re-running, dsub will create new jobs. Check for existing
+  outputs before re-submitting to avoid redundant computation.
+
+## Example: VADR Batch Analysis
+
+From the GATK-to-FreeBayes regression testing (PR #1053), we ran VADR on 30 FASTAs
+(15 assemblies x old/new) using dsub:
+
+```bash
+source ~/venvs/dsub/bin/activate  # or wherever dsub is installed
+
+dsub --provider google-cls-v2 \
+  --project gcid-viral-seq \
+  --regions us-central1 \
+  --image staphb/vadr:1.6.4 \
+  --machine-type n1-highmem-4 \
+  --script run_vadr.sh \
+  --tasks vadr_tasks.tsv \
+  --logging gs://viral-temp-30d/vadr_regression/logs/
+```
+
+The tasks TSV had columns for VADR options (`--env VADR_OPTS`), model URL
+(`--env MODEL_URL`), input FASTA (`--input FASTA`), and outputs
+(`--output NUM_ALERTS`, `--output ALERTS_TSV`, `--output VADR_TGZ`).
+
+All 30 jobs completed in ~15 minutes total (running in parallel on GCP).
@@ -0,0 +1,167 @@
+# Assembly Regression Testing
+
+End-to-end regression testing for assembly pipeline changes against Terra submissions.
+
+## When to Use
+
+Use this playbook when a PR makes functional changes to the assembly or variant-calling
+pipeline (e.g., swapping variant callers, changing alignment parameters, modifying
+consensus logic). It compares assembly outputs from old vs new code across hundreds
+of real samples to validate equivalence or improvement.
+
+## Prerequisites
+
+- **gcloud CLI** -- authenticated with access to Terra workspace GCS buckets
+- **mafft** -- for pairwise sequence alignment
+- **Python** with pandas and matplotlib (e.g., a dataviz venv)
+- **dsub** -- for running VADR batch jobs on GCP (see the `dsub-batch-jobs` skill)
+
+## Workflow
+
+### Step 1: Set Up Terra Submissions (Manual)
+
+The user must manually launch Terra submissions with old and new code:
+1. Run the pipeline on a representative dataset using the **main branch** Docker image
+2. Run the same pipeline on the same dataset using the **feature branch** Docker image
+3. Note the submission IDs and workspace bucket for both runs
+
+### Step 2: Discover Paired Samples
+
+Use `discover_pairs.py` to find all comparable old/new sample pairs by crawling
+GCS Cromwell output directories.
+
+```bash
+python discover_pairs.py \
+  --bucket <workspace-bucket-id> \
+  --old-sub <old-submission-id> \
+  --new-sub <new-submission-id> \
+  --output pairs.json
+```
+
+This produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples
+present in both submissions.
+
+### Step 3: Compare Assembly Outputs
+
+Use `compare_sample_pair.py` to compare each sample pair. This script:
+- Downloads assembly_stats TSVs from GCS
+- Compares metrics (coverage, % reference covered, length, etc.)
+- Downloads FASTA assemblies and aligns them with mafft
+- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions
+
+```bash
+python compare_sample_pair.py \
+  --old-tsv <gcs_uri> --new-tsv <gcs_uri> \
+  --work-dir ./results/<sample> \
+  --output-json ./results/<sample>.json
+```
+
+For batch processing, iterate over all entries in `pairs.json` and invoke
+`compare_sample_pair.py` for each sample pair (e.g., via a small wrapper
+script using `concurrent.futures` or `xargs`/GNU `parallel`).
+
+### Step 4: Generate Report
+
+Use `generate_report.py` to aggregate all per-sample JSONs into plots and a markdown report.
+
+```bash
+python generate_report.py \
+  --results-dir ./results/ \
+  --report-dir ./report/ \
+  --workspace-name <name>
+```
+
+Outputs:
+- Summary TSV with per-assembly metrics
+- 8 plots (scatter plots, histograms, identity distribution)
+- Markdown report with summary tables and divergent assembly details
+
+### Step 5: (Optional) VADR Annotation Quality
+
+For assemblies with internal indel differences, run VADR to assess whether indels
+cause frameshifts or other annotation problems. See the `dsub-batch-jobs` skill for
+details on running batch jobs via dsub.
+
+Use `run_vadr.sh` with dsub to run VADR on each FASTA:
+
+```bash
+dsub --provider google-cls-v2 \
+  --project <gcp-project> --regions us-central1 \
+  --image staphb/vadr:1.6.4 \
+  --machine-type n1-highmem-4 \
+  --script run_vadr.sh \
+  --tasks vadr_tasks.tsv \
+  --logging gs://<bucket>/vadr_logs/
+```
+
+VADR model parameters come from the viral-references repo:
+https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv
+
+Use the taxid from the assembly_id (format: `sample_id-taxid`) to look up the
+correct `vadr_opts`, `min_seq_len`, `max_seq_len`, `vadr_model_tar_url`, and
+`vadr_model_tar_subdir`.
+
+### Step 6: Post Results
+
+Post the report as a PR comment. Before posting:
+- **Self-review the proposed comment for confidential information** (sample names,
+  internal paths, credentials, etc.). Ask the user if in doubt about what is safe
+  to share publicly.
+- Include plots as image attachments if the PR is on GitHub
+- Attribute the analysis appropriately
+
+## Key Patterns
+
+### Per-Segment Alignment
+
+Multi-segment genomes (e.g., influenza with 8 segments) must be aligned
+**per-segment**, not as a single concatenated sequence. Otherwise, terminal
+effects at segment boundaries get misclassified as internal indels.
+
+The `compare_sample_pair.py` script handles this automatically: it pairs
+segments by FASTA header, aligns each pair independently, analyzes each
+alignment separately (so terminal effects stay terminal), and aggregates
+the statistics.
+
+### Event Counting vs BP Counting
+
+For indels, both counts matter:
+- **BP count**: Total gap positions (e.g., "49 bp of indels")
+- **Event count**: Contiguous gap runs (e.g., "13 indel events")
+
+A single 26bp insertion is 1 event but 26 bp. Event counts better reflect
+the number of variant-calling decisions that differ between old and new code.
+
+### VADR Frameshift Cascade Detection
+
+A single spurious 1bp indel in a coding region causes a cascade of VADR alerts:
+1. `FRAMESHIFT` -- the indel shifts the reading frame
+2. `STOP_CODON` -- premature stop codon in the shifted frame
+3. `UNEXPECTED_LENGTH` -- protein length doesn't match model
+4. `PEPTIDE_TRANSLATION_PROBLEM` -- for each downstream mature peptide
+
+When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means
+one version has frameshift-causing indels that the other avoids. Check the
+`.alt.list` files to confirm which genes are affected.
+
+## Interpreting Results
+
+### What to Look For
+
+1. **Identity distribution**: Most assemblies should be 100% identical. Any
+   below 99.9% warrant investigation.
+2. **SNP count = 0 for all assemblies**: Pipeline changes that only affect
+   indel calling (e.g., swapping variant callers) should produce zero SNPs.
+3. **Indel events**: The number and nature of indel differences. Are they in
+   coding regions? Do they cause frameshifts?
+4. **Coverage correlation**: Low-coverage samples (<10x) are most likely to
+   show differences between variant callers.
+5. **VADR alert deltas**: Fewer alerts = more biologically plausible assembly.
+   Large improvements (e.g., -31 alerts) strongly favor the new code.
+
+### Red Flags
+
+- Assemblies present in old but missing in new (or vice versa)
+- SNPs introduced where none existed before
+- VADR alerts increasing significantly for the new code
+- Differences concentrated in specific organisms/taxids