Skip to content

Commit a7d8b51

Browse files
authored
Merge pull request #1054 from broadinstitute/agent-skills-playbooks
Add modular agent skills framework
2 parents 3c1e6d6 + 8e200cd commit a7d8b51

File tree

11 files changed

+1579
-61
lines changed

11 files changed

+1579
-61
lines changed
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Container Vulnerability Management
2+
3+
Guidance for scanning, triaging, and mitigating container image vulnerabilities
4+
in the viral-ngs Docker image hierarchy.
5+
6+
## Scanning
7+
8+
Container images are scanned for vulnerabilities using [Trivy](https://aquasecurity.github.io/trivy/):
9+
10+
- **On every PR/push**: `docker.yml` scans each image flavor after build (SARIF -> GitHub Security tab, JSON -> artifact)
11+
- **Weekly schedule**: `container-scan.yml` scans the latest published images
12+
- Scans filter to **CRITICAL/HIGH** severity, **ignore-unfixed**, and apply a Rego policy (`.trivy-ignore-policy.rego`)
13+
- Per-CVE exceptions go in `.trivyignore` with mandatory justification comments
14+
15+
## Rego Policy (`.trivy-ignore-policy.rego`)
16+
17+
The Rego policy filters CVEs that are architecturally inapplicable to ephemeral batch containers:
18+
19+
- **AV:P** (Physical access required) -- containers are cloud-hosted
20+
- **AV:A** (Adjacent network required) -- no attacker on same network segment
21+
- **AV:L + UI:R** (Local + user interaction) -- no interactive sessions
22+
- **AV:L + PR:H** (Local + high privileges) -- containers run non-root
23+
- **AV:L + S:U** (Local + scope unchanged) -- attacker already has code execution and impact stays within the ephemeral container
24+
25+
Changes to this policy should be reviewed carefully. The comments in the file explain the rationale and risk for each rule.
26+
27+
## Common Vulnerability Sources
28+
29+
**Python transitive deps**: Pin minimum versions in `docker/requirements/*.txt`. Prefer conda packages over pip. Check conda-forge availability before assuming a version exists -- conda-forge often lags PyPI by days/weeks.
30+
31+
**Java fat JARs** (picard, gatk, snpeff, fgbio): Bioinformatics Java tools are distributed as uber JARs with all dependencies bundled inside. Trivy detects vulnerable libraries (log4j, commons-compress, etc.) baked into these JARs. Version bumps can cause ARM64 conda solver conflicts because Java tools pull in openjdk -> harfbuzz -> icu version chains that clash with other packages (r-base, boost-cpp, pyicu). Always check:
32+
1. Whether the tool is actually flagged by Trivy (don't bump versions unnecessarily)
33+
2. Whether the CVE applies (e.g., log4j 1.x is NOT vulnerable to Log4Shell)
34+
3. Whether the desired version resolves on ARM64 before pushing
35+
36+
**Go binaries**: Some conda packages bundle compiled Go binaries (e.g., mafft's `dash_client`, google-cloud-sdk's `gcloud-crc32c`). If the binary is unused, delete it in the Dockerfile. Delete from **both** the installed location and `/opt/conda/pkgs/*/` (conda package cache) -- Trivy scans the full filesystem.
37+
38+
**Vendored copies**: Packages like google-cloud-sdk and setuptools bundle their own copies of Python libraries that may be older than what's in the conda environment. Trivy flags these vendored copies separately. Options: delete the vendored directory (if not needed at runtime), or accept the risk in `.trivyignore` with justification.
39+
40+
## ARM64 Solver Conflicts
41+
42+
The conda solver on ARM64 (linux-aarch64) is more constrained than amd64 because fewer package builds exist. Common conflict patterns:
43+
44+
- **icu version conflicts**: Many packages (openjdk, r-base, boost-cpp, pyicu) pin specific icu version ranges. Bumping one package can make the entire environment unsolvable.
45+
- **libdeflate/htslib conflicts**: lofreq 2.1.5 pins old htslib/libdeflate versions that conflict with newer pillow/libtiff.
46+
- **openjdk version escalation**: snpeff 5.2+ requires openjdk>=11, 5.3+ requires openjdk>=21. Higher openjdk versions pull in harfbuzz->icu chains that conflict with everything.
47+
48+
When a solver conflict occurs: revert the change, check what version the solver was picking before, and pin to that exact version if it already addresses the CVE.
49+
50+
## Mitigation Decision Process
51+
52+
When triaging a CVE:
53+
54+
1. **Check the CVSS vector** -- does the Rego policy already filter it?
55+
2. **Identify the source package** -- use Trivy JSON output (`PkgName`, `PkgPath`, `InstalledVersion`)
56+
3. **Check if a fix version exists on conda-forge/bioconda** -- not just on PyPI
57+
4. **Test on ARM64** -- solver conflicts are the most common failure mode
58+
5. **If the fix version conflicts**: consider whether the CVE is exploitable in your deployment model. Document the risk assessment in `.trivyignore` or `vulnerability-mitigation-status.md`.
59+
6. **If the vulnerable code is unused**: delete the binary/file inline in the Dockerfile (same RUN layer as install to avoid bloating images)
60+
61+
## Key Files
62+
63+
| File | Purpose |
64+
|------|---------|
65+
| `.trivy-ignore-policy.rego` | Rego policy for class-level CVE filtering |
66+
| `.trivyignore` | Per-CVE exceptions with justifications |
67+
| `.github/workflows/docker.yml` | Build-time scanning (SARIF + JSON) |
68+
| `.github/workflows/container-scan.yml` | Weekly scheduled scanning |
69+
| `vulnerability-mitigation-status.md` | Local-only tracking doc (not committed) |
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Running Batch Jobs on GCP via dsub
2+
3+
Use dsub to run one-off compute jobs on Google Cloud when your analysis requires
4+
more compute/memory than the local environment, or needs specific Docker images
5+
that are impractical to run locally.
6+
7+
## When to Use
8+
9+
- Analysis tools need >8GB RAM (e.g., VADR, BLAST, assembly)
10+
- Need to run many independent jobs in parallel (batch processing)
11+
- Need a specific Docker image with pre-installed tools
12+
- Data lives in GCS and is most efficiently processed in-cloud
13+
14+
## Prerequisites
15+
16+
- **dsub** installed (ask the user where their dsub installation or venv is located)
17+
- **gcloud CLI** authenticated with a GCP project that has Batch API enabled
18+
- **GCS bucket** accessible by the project's default service account
19+
20+
## Generic Invocation
21+
22+
```bash
23+
dsub --provider google-cls-v2 \
24+
--project <gcp-project> \
25+
--regions <region> \
26+
--image <docker-image> \
27+
--machine-type <machine-type> \
28+
--script <script.sh> \
29+
--tasks <tasks.tsv> \
30+
--logging gs://<bucket>/logs/<job-name>/
31+
```
32+
33+
### Key Parameters
34+
35+
| Parameter | Description |
36+
|-----------|-------------|
37+
| `--provider google-cls-v2` | Use Google Cloud Batch (recommended over Life Sciences) |
38+
| `--project` | GCP project with Batch API enabled |
39+
| `--regions` | Compute region (e.g., `us-central1`) |
40+
| `--image` | Docker image to run (e.g., `staphb/vadr:1.6.4`) |
41+
| `--machine-type` | VM type (e.g., `n1-highmem-4` for 26GB RAM) |
42+
| `--script` | Local shell script to execute inside the container |
43+
| `--tasks` | TSV file defining one row per job (batch mode) |
44+
| `--logging` | GCS path for stdout/stderr logs |
45+
46+
## Task TSV Format
47+
48+
The tasks TSV defines inputs, outputs, and environment variables for each job.
49+
Header row uses column prefixes to declare types:
50+
51+
```
52+
--env VAR1 --env VAR2 --input FASTA --output RESULT --output LOG
53+
value1 value2 gs://bucket/input.fasta gs://bucket/output.txt gs://bucket/log.txt
54+
```
55+
56+
- `--env NAME` -- environment variable passed to the script
57+
- `--input NAME` -- GCS file downloaded to a local path; the env var is set to the local path
58+
- `--output NAME` -- local path; after the script finishes, the file is uploaded to GCS
59+
60+
Each non-header row is one job. All jobs run independently in parallel.
61+
62+
## GCP Project and Bucket Scoping
63+
64+
The service account running dsub jobs must have read/write access to all GCS paths
65+
referenced in the tasks TSV. The simplest approach:
66+
67+
1. Use a GCP project whose default service account already has access to your data
68+
2. Use a bucket within that same project for staging intermediate/output files
69+
3. For ephemeral results, use a temp bucket with a lifecycle policy (e.g., 30-day auto-delete)
70+
71+
### Broad Viral Genomics Defaults
72+
73+
Most developers on the viral-ngs codebase use:
74+
- **GCP project**: `gcid-viral-seq`
75+
- **Staging bucket**: `gs://viral-temp-30d` (30-day auto-delete lifecycle)
76+
77+
These are not universal -- always confirm with the user before using them.
78+
79+
## Monitoring Jobs
80+
81+
```bash
82+
# Check job status
83+
dsub --provider google-cls-v2 --project <project> --jobs <job-id> --status
84+
85+
# Or use dstat
86+
dstat --provider google-cls-v2 --project <project> --jobs <job-id> --status '*'
87+
88+
# View logs
89+
gcloud storage cat gs://<bucket>/logs/<job-name>/<task-id>.log
90+
```
91+
92+
## Tips
93+
94+
- **Batch over single jobs**: Always prefer `--tasks` with a TSV over individual
95+
`dsub` invocations. One TSV row per job is cleaner and easier to track.
96+
- **Machine sizing**: Check your tool's memory requirements. VADR needs ~16GB;
97+
use `n1-highmem-4` (26GB). Most tools work fine with `n1-standard-4` (15GB).
98+
- **Script portability**: Write the `--script` to be self-contained. It receives
99+
inputs/outputs as environment variables. Don't assume any local state.
100+
- **Logging**: Always set `--logging` to a GCS path so you can debug failures.
101+
- **Idempotency**: If re-running, dsub will create new jobs. Check for existing
102+
outputs before re-submitting to avoid redundant computation.
103+
104+
## Example: VADR Batch Analysis
105+
106+
From the GATK-to-FreeBayes regression testing (PR #1053), we ran VADR on 30 FASTAs
107+
(15 assemblies x old/new) using dsub:
108+
109+
```bash
110+
source ~/venvs/dsub/bin/activate # or wherever dsub is installed
111+
112+
dsub --provider google-cls-v2 \
113+
--project gcid-viral-seq \
114+
--regions us-central1 \
115+
--image staphb/vadr:1.6.4 \
116+
--machine-type n1-highmem-4 \
117+
--script run_vadr.sh \
118+
--tasks vadr_tasks.tsv \
119+
--logging gs://viral-temp-30d/vadr_regression/logs/
120+
```
121+
122+
The tasks TSV had columns for VADR options (`--env VADR_OPTS`), model URL
123+
(`--env MODEL_URL`), input FASTA (`--input FASTA`), and outputs
124+
(`--output NUM_ALERTS`, `--output ALERTS_TSV`, `--output VADR_TGZ`).
125+
126+
All 30 jobs completed in ~15 minutes total (running in parallel on GCP).
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Assembly Regression Testing
2+
3+
End-to-end regression testing for assembly pipeline changes against Terra submissions.
4+
5+
## When to Use
6+
7+
Use this playbook when a PR makes functional changes to the assembly or variant-calling
8+
pipeline (e.g., swapping variant callers, changing alignment parameters, modifying
9+
consensus logic). It compares assembly outputs from old vs new code across hundreds
10+
of real samples to validate equivalence or improvement.
11+
12+
## Prerequisites
13+
14+
- **gcloud CLI** -- authenticated with access to Terra workspace GCS buckets
15+
- **mafft** -- for pairwise sequence alignment
16+
- **Python** with pandas and matplotlib (e.g., a dataviz venv)
17+
- **dsub** -- for running VADR batch jobs on GCP (see the `dsub-batch-jobs` skill)
18+
19+
## Workflow
20+
21+
### Step 1: Set Up Terra Submissions (Manual)
22+
23+
The user must manually launch Terra submissions with old and new code:
24+
1. Run the pipeline on a representative dataset using the **main branch** Docker image
25+
2. Run the same pipeline on the same dataset using the **feature branch** Docker image
26+
3. Note the submission IDs and workspace bucket for both runs
27+
28+
### Step 2: Discover Paired Samples
29+
30+
Use `discover_pairs.py` to find all comparable old/new sample pairs by crawling
31+
GCS Cromwell output directories.
32+
33+
```bash
34+
python discover_pairs.py \
35+
--bucket <workspace-bucket-id> \
36+
--old-sub <old-submission-id> \
37+
--new-sub <new-submission-id> \
38+
--output pairs.json
39+
```
40+
41+
This produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples
42+
present in both submissions.
43+
44+
### Step 3: Compare Assembly Outputs
45+
46+
Use `compare_sample_pair.py` to compare each sample pair. This script:
47+
- Downloads assembly_stats TSVs from GCS
48+
- Compares metrics (coverage, % reference covered, length, etc.)
49+
- Downloads FASTA assemblies and aligns them with mafft
50+
- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions
51+
52+
```bash
53+
python compare_sample_pair.py \
54+
--old-tsv <gcs_uri> --new-tsv <gcs_uri> \
55+
--work-dir ./results/<sample> \
56+
--output-json ./results/<sample>.json
57+
```
58+
59+
For batch processing, iterate over all entries in `pairs.json` and invoke
60+
`compare_sample_pair.py` for each sample pair (e.g., via a small wrapper
61+
script using `concurrent.futures` or `xargs`/GNU `parallel`).
62+
63+
### Step 4: Generate Report
64+
65+
Use `generate_report.py` to aggregate all per-sample JSONs into plots and a markdown report.
66+
67+
```bash
68+
python generate_report.py \
69+
--results-dir ./results/ \
70+
--report-dir ./report/ \
71+
--workspace-name <name>
72+
```
73+
74+
Outputs:
75+
- Summary TSV with per-assembly metrics
76+
- 8 plots (scatter plots, histograms, identity distribution)
77+
- Markdown report with summary tables and divergent assembly details
78+
79+
### Step 5: (Optional) VADR Annotation Quality
80+
81+
For assemblies with internal indel differences, run VADR to assess whether indels
82+
cause frameshifts or other annotation problems. See the `dsub-batch-jobs` skill for
83+
details on running batch jobs via dsub.
84+
85+
Use `run_vadr.sh` with dsub to run VADR on each FASTA:
86+
87+
```bash
88+
dsub --provider google-cls-v2 \
89+
--project <gcp-project> --regions us-central1 \
90+
--image staphb/vadr:1.6.4 \
91+
--machine-type n1-highmem-4 \
92+
--script run_vadr.sh \
93+
--tasks vadr_tasks.tsv \
94+
--logging gs://<bucket>/vadr_logs/
95+
```
96+
97+
VADR model parameters come from the viral-references repo:
98+
https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv
99+
100+
Use the taxid from the assembly_id (format: `sample_id-taxid`) to look up the
101+
correct `vadr_opts`, `min_seq_len`, `max_seq_len`, `vadr_model_tar_url`, and
102+
`vadr_model_tar_subdir`.
103+
104+
### Step 6: Post Results
105+
106+
Post the report as a PR comment. Before posting:
107+
- **Self-review the proposed comment for confidential information** (sample names,
108+
internal paths, credentials, etc.). Ask the user if in doubt about what is safe
109+
to share publicly.
110+
- Include plots as image attachments if the PR is on GitHub
111+
- Attribute the analysis appropriately
112+
113+
## Key Patterns
114+
115+
### Per-Segment Alignment
116+
117+
Multi-segment genomes (e.g., influenza with 8 segments) must be aligned
118+
**per-segment**, not as a single concatenated sequence. Otherwise, terminal
119+
effects at segment boundaries get misclassified as internal indels.
120+
121+
The `compare_sample_pair.py` script handles this automatically: it pairs
122+
segments by FASTA header, aligns each pair independently, analyzes each
123+
alignment separately (so terminal effects stay terminal), and aggregates
124+
the statistics.
125+
126+
### Event Counting vs BP Counting
127+
128+
For indels, both counts matter:
129+
- **BP count**: Total gap positions (e.g., "49 bp of indels")
130+
- **Event count**: Contiguous gap runs (e.g., "13 indel events")
131+
132+
A single 26bp insertion is 1 event but 26 bp. Event counts better reflect
133+
the number of variant-calling decisions that differ between old and new code.
134+
135+
### VADR Frameshift Cascade Detection
136+
137+
A single spurious 1bp indel in a coding region causes a cascade of VADR alerts:
138+
1. `FRAMESHIFT` -- the indel shifts the reading frame
139+
2. `STOP_CODON` -- premature stop codon in the shifted frame
140+
3. `UNEXPECTED_LENGTH` -- protein length doesn't match model
141+
4. `PEPTIDE_TRANSLATION_PROBLEM` -- for each downstream mature peptide
142+
143+
When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means
144+
one version has frameshift-causing indels that the other avoids. Check the
145+
`.alt.list` files to confirm which genes are affected.
146+
147+
## Interpreting Results
148+
149+
### What to Look For
150+
151+
1. **Identity distribution**: Most assemblies should be 100% identical. Any
152+
below 99.9% warrant investigation.
153+
2. **SNP count = 0 for all assemblies**: Pipeline changes that only affect
154+
indel calling (e.g., swapping variant callers) should produce zero SNPs.
155+
3. **Indel events**: The number and nature of indel differences. Are they in
156+
coding regions? Do they cause frameshifts?
157+
4. **Coverage correlation**: Low-coverage samples (<10x) are most likely to
158+
show differences between variant callers.
159+
5. **VADR alert deltas**: Fewer alerts = more biologically plausible assembly.
160+
Large improvements (e.g., -31 alerts) strongly favor the new code.
161+
162+
### Red Flags
163+
164+
- Assemblies present in old but missing in new (or vice versa)
165+
- SNPs introduced where none existed before
166+
- VADR alerts increasing significantly for the new code
167+
- Differences concentrated in specific organisms/taxids

0 commit comments

Comments
 (0)