Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions .agents/skills/container-vulns/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Container Vulnerability Management

Guidance for scanning, triaging, and mitigating container image vulnerabilities
in the viral-ngs Docker image hierarchy.

## Scanning

Container images are scanned for vulnerabilities using [Trivy](https://aquasecurity.github.io/trivy/):

- **On every PR/push**: `docker.yml` scans each image flavor after build (SARIF -> GitHub Security tab, JSON -> artifact)
- **Weekly schedule**: `container-scan.yml` scans the latest published images
- Scans filter to **CRITICAL/HIGH** severity, **ignore-unfixed**, and apply a Rego policy (`.trivy-ignore-policy.rego`)
- Per-CVE exceptions go in `.trivyignore` with mandatory justification comments

## Rego Policy (`.trivy-ignore-policy.rego`)

The Rego policy filters CVEs that are architecturally inapplicable to ephemeral batch containers:

- **AV:P** (Physical access required) -- containers are cloud-hosted
- **AV:A** (Adjacent network required) -- no attacker on same network segment
- **AV:L + UI:R** (Local + user interaction) -- no interactive sessions
- **AV:L + PR:H** (Local + high privileges) -- containers run non-root
- **AV:L + S:U** (Local + scope unchanged) -- attacker already has code execution and impact stays within the ephemeral container

Changes to this policy should be reviewed carefully. The comments in the file explain the rationale and risk for each rule.

## Common Vulnerability Sources

**Python transitive deps**: Pin minimum versions in `docker/requirements/*.txt`. Prefer conda packages over pip. Check conda-forge availability before assuming a version exists -- conda-forge often lags PyPI by days/weeks.

**Java fat JARs** (picard, gatk, snpeff, fgbio): Bioinformatics Java tools are distributed as uber JARs with all dependencies bundled inside. Trivy detects vulnerable libraries (log4j, commons-compress, etc.) baked into these JARs. Version bumps can cause ARM64 conda solver conflicts because Java tools pull in openjdk -> harfbuzz -> icu version chains that clash with other packages (r-base, boost-cpp, pyicu). Always check:
1. Whether the tool is actually flagged by Trivy (don't bump versions unnecessarily)
2. Whether the CVE applies (e.g., log4j 1.x is NOT vulnerable to Log4Shell)
3. Whether the desired version resolves on ARM64 before pushing

**Go binaries**: Some conda packages bundle compiled Go binaries (e.g., mafft's `dash_client`, google-cloud-sdk's `gcloud-crc32c`). If the binary is unused, delete it in the Dockerfile. Delete from **both** the installed location and `/opt/conda/pkgs/*/` (conda package cache) -- Trivy scans the full filesystem.

**Vendored copies**: Packages like google-cloud-sdk and setuptools bundle their own copies of Python libraries that may be older than what's in the conda environment. Trivy flags these vendored copies separately. Options: delete the vendored directory (if not needed at runtime), or accept the risk in `.trivyignore` with justification.

## ARM64 Solver Conflicts

The conda solver on ARM64 (linux-aarch64) is more constrained than amd64 because fewer package builds exist. Common conflict patterns:

- **icu version conflicts**: Many packages (openjdk, r-base, boost-cpp, pyicu) pin specific icu version ranges. Bumping one package can make the entire environment unsolvable.
- **libdeflate/htslib conflicts**: lofreq 2.1.5 pins old htslib/libdeflate versions that conflict with newer pillow/libtiff.
- **openjdk version escalation**: snpeff 5.2+ requires openjdk>=11, 5.3+ requires openjdk>=21. Higher openjdk versions pull in harfbuzz->icu chains that conflict with everything.

When a solver conflict occurs: revert the change, check what version the solver was picking before, and pin to that exact version if it already addresses the CVE.

## Mitigation Decision Process

When triaging a CVE:

1. **Check the CVSS vector** -- does the Rego policy already filter it?
2. **Identify the source package** -- use Trivy JSON output (`PkgName`, `PkgPath`, `InstalledVersion`)
3. **Check if a fix version exists on conda-forge/bioconda** -- not just on PyPI
4. **Test on ARM64** -- solver conflicts are the most common failure mode
5. **If the fix version conflicts**: consider whether the CVE is exploitable in your deployment model. Document the risk assessment in `.trivyignore` or `vulnerability-mitigation-status.md`.
6. **If the vulnerable code is unused**: delete the binary/file inline in the Dockerfile (same RUN layer as install to avoid bloating images)

## Key Files

| File | Purpose |
|------|---------|
| `.trivy-ignore-policy.rego` | Rego policy for class-level CVE filtering |
| `.trivyignore` | Per-CVE exceptions with justifications |
| `.github/workflows/docker.yml` | Build-time scanning (SARIF + JSON) |
| `.github/workflows/container-scan.yml` | Weekly scheduled scanning |
| `vulnerability-mitigation-status.md` | Local-only tracking doc (not committed) |
126 changes: 126 additions & 0 deletions .agents/skills/dsub-batch-jobs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Running Batch Jobs on GCP via dsub

Use dsub to run one-off compute jobs on Google Cloud when your analysis requires
more compute/memory than the local environment, or needs specific Docker images
that are impractical to run locally.

## When to Use

- Analysis tools need >8GB RAM (e.g., VADR, BLAST, assembly)
- Need to run many independent jobs in parallel (batch processing)
- Need a specific Docker image with pre-installed tools
- Data lives in GCS and is most efficiently processed in-cloud

## Prerequisites

- **dsub** installed (ask the user where their dsub installation or venv is located)
- **gcloud CLI** authenticated with a GCP project that has Batch API enabled
- **GCS bucket** accessible by the project's default service account

## Generic Invocation

```bash
dsub --provider google-cls-v2 \
--project <gcp-project> \
--regions <region> \
--image <docker-image> \
--machine-type <machine-type> \
--script <script.sh> \
--tasks <tasks.tsv> \
--logging gs://<bucket>/logs/<job-name>/
```

### Key Parameters

| Parameter | Description |
|-----------|-------------|
| `--provider google-cls-v2` | Use Google Cloud Batch (recommended over Life Sciences) |
| `--project` | GCP project with Batch API enabled |
| `--regions` | Compute region (e.g., `us-central1`) |
| `--image` | Docker image to run (e.g., `staphb/vadr:1.6.4`) |
| `--machine-type` | VM type (e.g., `n1-highmem-4` for 26GB RAM) |
| `--script` | Local shell script to execute inside the container |
| `--tasks` | TSV file defining one row per job (batch mode) |
| `--logging` | GCS path for stdout/stderr logs |

## Task TSV Format

The tasks TSV defines inputs, outputs, and environment variables for each job.
Header row uses column prefixes to declare types:

```
--env VAR1 --env VAR2 --input FASTA --output RESULT --output LOG
value1 value2 gs://bucket/input.fasta gs://bucket/output.txt gs://bucket/log.txt
```

- `--env NAME` -- environment variable passed to the script
- `--input NAME` -- GCS file downloaded to a local path; the env var is set to the local path
- `--output NAME` -- local path; after the script finishes, the file is uploaded to GCS

Each non-header row is one job. All jobs run independently in parallel.

## GCP Project and Bucket Scoping

The service account running dsub jobs must have read/write access to all GCS paths
referenced in the tasks TSV. The simplest approach:

1. Use a GCP project whose default service account already has access to your data
2. Use a bucket within that same project for staging intermediate/output files
3. For ephemeral results, use a temp bucket with a lifecycle policy (e.g., 30-day auto-delete)

### Broad Viral Genomics Defaults

Most developers on the viral-ngs codebase use:
- **GCP project**: `gcid-viral-seq`
- **Staging bucket**: `gs://viral-temp-30d` (30-day auto-delete lifecycle)

These are not universal -- always confirm with the user before using them.

## Monitoring Jobs

```bash
# Check job status
dsub --provider google-cls-v2 --project <project> --jobs <job-id> --status

# Or use dstat
dstat --provider google-cls-v2 --project <project> --jobs <job-id> --status '*'

# View logs
gcloud storage cat gs://<bucket>/logs/<job-name>/<task-id>.log
```

## Tips

- **Batch over single jobs**: Always prefer `--tasks` with a TSV over individual
`dsub` invocations. One TSV row per job is cleaner and easier to track.
- **Machine sizing**: Check your tool's memory requirements. VADR needs ~16GB;
use `n1-highmem-4` (26GB). Most tools work fine with `n1-standard-4` (15GB).
- **Script portability**: Write the `--script` to be self-contained. It receives
inputs/outputs as environment variables. Don't assume any local state.
- **Logging**: Always set `--logging` to a GCS path so you can debug failures.
- **Idempotency**: If re-running, dsub will create new jobs. Check for existing
outputs before re-submitting to avoid redundant computation.

## Example: VADR Batch Analysis

From the GATK-to-FreeBayes regression testing (PR #1053), we ran VADR on 30 FASTAs
(15 assemblies x old/new) using dsub:

```bash
source ~/venvs/dsub/bin/activate # or wherever dsub is installed

dsub --provider google-cls-v2 \
--project gcid-viral-seq \
--regions us-central1 \
--image staphb/vadr:1.6.4 \
--machine-type n1-highmem-4 \
--script run_vadr.sh \
--tasks vadr_tasks.tsv \
--logging gs://viral-temp-30d/vadr_regression/logs/
```

The tasks TSV had columns for VADR options (`--env VADR_OPTS`), model URL
(`--env MODEL_URL`), input FASTA (`--input FASTA`), and outputs
(`--output NUM_ALERTS`, `--output ALERTS_TSV`, `--output VADR_TGZ`).

All 30 jobs completed in ~15 minutes total (running in parallel on GCP).
167 changes: 167 additions & 0 deletions .agents/skills/regression-testing/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Assembly Regression Testing

End-to-end regression testing for assembly pipeline changes against Terra submissions.

## When to Use

Use this playbook when a PR makes functional changes to the assembly or variant-calling
pipeline (e.g., swapping variant callers, changing alignment parameters, modifying
consensus logic). It compares assembly outputs from old vs new code across hundreds
of real samples to validate equivalence or improvement.

## Prerequisites

- **gcloud CLI** -- authenticated with access to Terra workspace GCS buckets
- **mafft** -- for pairwise sequence alignment
- **Python** with pandas and matplotlib (e.g., a dataviz venv)
- **dsub** -- for running VADR batch jobs on GCP (see the `dsub-batch-jobs` skill)

## Workflow

### Step 1: Set Up Terra Submissions (Manual)

The user must manually launch Terra submissions with old and new code:
1. Run the pipeline on a representative dataset using the **main branch** Docker image
2. Run the same pipeline on the same dataset using the **feature branch** Docker image
3. Note the submission IDs and workspace bucket for both runs

### Step 2: Discover Paired Samples

Use `discover_pairs.py` to find all comparable old/new sample pairs by crawling
GCS Cromwell output directories.

```bash
python discover_pairs.py \
--bucket <workspace-bucket-id> \
--old-sub <old-submission-id> \
--new-sub <new-submission-id> \
--output pairs.json
```

This produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples
present in both submissions.

### Step 3: Compare Assembly Outputs

Use `compare_sample_pair.py` to compare each sample pair. This script:
- Downloads assembly_stats TSVs from GCS
- Compares metrics (coverage, % reference covered, length, etc.)
- Downloads FASTA assemblies and aligns them with mafft
- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions

```bash
python compare_sample_pair.py \
--old-tsv <gcs_uri> --new-tsv <gcs_uri> \
--work-dir ./results/<sample> \
--output-json ./results/<sample>.json
```

For batch processing, iterate over all entries in `pairs.json` and invoke
`compare_sample_pair.py` for each sample pair (e.g., via a small wrapper
script using `concurrent.futures` or `xargs`/GNU `parallel`).

### Step 4: Generate Report

Use `generate_report.py` to aggregate all per-sample JSONs into plots and a markdown report.

```bash
python generate_report.py \
--results-dir ./results/ \
--report-dir ./report/ \
--workspace-name <name>
```

Outputs:
- Summary TSV with per-assembly metrics
- 8 plots (scatter plots, histograms, identity distribution)
- Markdown report with summary tables and divergent assembly details

### Step 5: (Optional) VADR Annotation Quality

For assemblies with internal indel differences, run VADR to assess whether indels
cause frameshifts or other annotation problems. See the `dsub-batch-jobs` skill for
details on running batch jobs via dsub.

Use `run_vadr.sh` with dsub to run VADR on each FASTA:

```bash
dsub --provider google-cls-v2 \
--project <gcp-project> --regions us-central1 \
--image staphb/vadr:1.6.4 \
--machine-type n1-highmem-4 \
--script run_vadr.sh \
--tasks vadr_tasks.tsv \
--logging gs://<bucket>/vadr_logs/
```

VADR model parameters come from the viral-references repo:
https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv

Use the taxid from the assembly_id (format: `sample_id-taxid`) to look up the
correct `vadr_opts`, `min_seq_len`, `max_seq_len`, `vadr_model_tar_url`, and
`vadr_model_tar_subdir`.

### Step 6: Post Results

Post the report as a PR comment. Before posting:
- **Self-review the proposed comment for confidential information** (sample names,
internal paths, credentials, etc.). Ask the user if in doubt about what is safe
to share publicly.
- Include plots as image attachments if the PR is on GitHub
- Attribute the analysis appropriately

## Key Patterns

### Per-Segment Alignment

Multi-segment genomes (e.g., influenza with 8 segments) must be aligned
**per-segment**, not as a single concatenated sequence. Otherwise, terminal
effects at segment boundaries get misclassified as internal indels.

The `compare_sample_pair.py` script handles this automatically: it pairs
segments by FASTA header, aligns each pair independently, analyzes each
alignment separately (so terminal effects stay terminal), and aggregates
the statistics.

### Event Counting vs BP Counting

For indels, both counts matter:
- **BP count**: Total gap positions (e.g., "49 bp of indels")
- **Event count**: Contiguous gap runs (e.g., "13 indel events")

A single 26bp insertion is 1 event but 26 bp. Event counts better reflect
the number of variant-calling decisions that differ between old and new code.

### VADR Frameshift Cascade Detection

A single spurious 1bp indel in a coding region causes a cascade of VADR alerts:
1. `FRAMESHIFT` -- the indel shifts the reading frame
2. `STOP_CODON` -- premature stop codon in the shifted frame
3. `UNEXPECTED_LENGTH` -- protein length doesn't match model
4. `PEPTIDE_TRANSLATION_PROBLEM` -- for each downstream mature peptide

When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means
one version has frameshift-causing indels that the other avoids. Check the
`.alt.list` files to confirm which genes are affected.

## Interpreting Results

### What to Look For

1. **Identity distribution**: Most assemblies should be 100% identical. Any
below 99.9% warrant investigation.
2. **SNP count = 0 for all assemblies**: Pipeline changes that only affect
indel calling (e.g., swapping variant callers) should produce zero SNPs.
3. **Indel events**: The number and nature of indel differences. Are they in
coding regions? Do they cause frameshifts?
4. **Coverage correlation**: Low-coverage samples (<10x) are most likely to
show differences between variant callers.
5. **VADR alert deltas**: Fewer alerts = more biologically plausible assembly.
Large improvements (e.g., -31 alerts) strongly favor the new code.

### Red Flags

- Assemblies present in old but missing in new (or vice versa)
- SNPs introduced where none existed before
- VADR alerts increasing significantly for the new code
- Differences concentrated in specific organisms/taxids
Loading
Loading