diff --git a/.agents/skills/container-vulns/SKILL.md b/.agents/skills/container-vulns/SKILL.md new file mode 100644 index 000000000..8664b9807 --- /dev/null +++ b/.agents/skills/container-vulns/SKILL.md @@ -0,0 +1,69 @@ +# Container Vulnerability Management + +Guidance for scanning, triaging, and mitigating container image vulnerabilities +in the viral-ngs Docker image hierarchy. + +## Scanning + +Container images are scanned for vulnerabilities using [Trivy](https://aquasecurity.github.io/trivy/): + +- **On every PR/push**: `docker.yml` scans each image flavor after build (SARIF -> GitHub Security tab, JSON -> artifact) +- **Weekly schedule**: `container-scan.yml` scans the latest published images +- Scans filter to **CRITICAL/HIGH** severity, **ignore-unfixed**, and apply a Rego policy (`.trivy-ignore-policy.rego`) +- Per-CVE exceptions go in `.trivyignore` with mandatory justification comments + +## Rego Policy (`.trivy-ignore-policy.rego`) + +The Rego policy filters CVEs that are architecturally inapplicable to ephemeral batch containers: + +- **AV:P** (Physical access required) -- containers are cloud-hosted +- **AV:A** (Adjacent network required) -- no attacker on same network segment +- **AV:L + UI:R** (Local + user interaction) -- no interactive sessions +- **AV:L + PR:H** (Local + high privileges) -- containers run non-root +- **AV:L + S:U** (Local + scope unchanged) -- attacker already has code execution and impact stays within the ephemeral container + +Changes to this policy should be reviewed carefully. The comments in the file explain the rationale and risk for each rule. + +## Common Vulnerability Sources + +**Python transitive deps**: Pin minimum versions in `docker/requirements/*.txt`. Prefer conda packages over pip. Check conda-forge availability before assuming a version exists -- conda-forge often lags PyPI by days/weeks. + +**Java fat JARs** (picard, gatk, snpeff, fgbio): Bioinformatics Java tools are distributed as uber JARs with all dependencies bundled inside. Trivy detects vulnerable libraries (log4j, commons-compress, etc.) baked into these JARs. Version bumps can cause ARM64 conda solver conflicts because Java tools pull in openjdk -> harfbuzz -> icu version chains that clash with other packages (r-base, boost-cpp, pyicu). Always check: +1. Whether the tool is actually flagged by Trivy (don't bump versions unnecessarily) +2. Whether the CVE applies (e.g., log4j 1.x is NOT vulnerable to Log4Shell) +3. Whether the desired version resolves on ARM64 before pushing + +**Go binaries**: Some conda packages bundle compiled Go binaries (e.g., mafft's `dash_client`, google-cloud-sdk's `gcloud-crc32c`). If the binary is unused, delete it in the Dockerfile. Delete from **both** the installed location and `/opt/conda/pkgs/*/` (conda package cache) -- Trivy scans the full filesystem. + +**Vendored copies**: Packages like google-cloud-sdk and setuptools bundle their own copies of Python libraries that may be older than what's in the conda environment. Trivy flags these vendored copies separately. Options: delete the vendored directory (if not needed at runtime), or accept the risk in `.trivyignore` with justification. + +## ARM64 Solver Conflicts + +The conda solver on ARM64 (linux-aarch64) is more constrained than amd64 because fewer package builds exist. Common conflict patterns: + +- **icu version conflicts**: Many packages (openjdk, r-base, boost-cpp, pyicu) pin specific icu version ranges. Bumping one package can make the entire environment unsolvable. +- **libdeflate/htslib conflicts**: lofreq 2.1.5 pins old htslib/libdeflate versions that conflict with newer pillow/libtiff. +- **openjdk version escalation**: snpeff 5.2+ requires openjdk>=11, 5.3+ requires openjdk>=21. Higher openjdk versions pull in harfbuzz->icu chains that conflict with everything. + +When a solver conflict occurs: revert the change, check what version the solver was picking before, and pin to that exact version if it already addresses the CVE. + +## Mitigation Decision Process + +When triaging a CVE: + +1. **Check the CVSS vector** -- does the Rego policy already filter it? +2. **Identify the source package** -- use Trivy JSON output (`PkgName`, `PkgPath`, `InstalledVersion`) +3. **Check if a fix version exists on conda-forge/bioconda** -- not just on PyPI +4. **Test on ARM64** -- solver conflicts are the most common failure mode +5. **If the fix version conflicts**: consider whether the CVE is exploitable in your deployment model. Document the risk assessment in `.trivyignore` or `vulnerability-mitigation-status.md`. +6. **If the vulnerable code is unused**: delete the binary/file inline in the Dockerfile (same RUN layer as install to avoid bloating images) + +## Key Files + +| File | Purpose | +|------|---------| +| `.trivy-ignore-policy.rego` | Rego policy for class-level CVE filtering | +| `.trivyignore` | Per-CVE exceptions with justifications | +| `.github/workflows/docker.yml` | Build-time scanning (SARIF + JSON) | +| `.github/workflows/container-scan.yml` | Weekly scheduled scanning | +| `vulnerability-mitigation-status.md` | Local-only tracking doc (not committed) | diff --git a/.agents/skills/dsub-batch-jobs/SKILL.md b/.agents/skills/dsub-batch-jobs/SKILL.md new file mode 100644 index 000000000..211bd45bf --- /dev/null +++ b/.agents/skills/dsub-batch-jobs/SKILL.md @@ -0,0 +1,126 @@ +# Running Batch Jobs on GCP via dsub + +Use dsub to run one-off compute jobs on Google Cloud when your analysis requires +more compute/memory than the local environment, or needs specific Docker images +that are impractical to run locally. + +## When to Use + +- Analysis tools need >8GB RAM (e.g., VADR, BLAST, assembly) +- Need to run many independent jobs in parallel (batch processing) +- Need a specific Docker image with pre-installed tools +- Data lives in GCS and is most efficiently processed in-cloud + +## Prerequisites + +- **dsub** installed (ask the user where their dsub installation or venv is located) +- **gcloud CLI** authenticated with a GCP project that has Batch API enabled +- **GCS bucket** accessible by the project's default service account + +## Generic Invocation + +```bash +dsub --provider google-cls-v2 \ + --project \ + --regions \ + --image \ + --machine-type \ + --script \ + --tasks \ + --logging gs:///logs// +``` + +### Key Parameters + +| Parameter | Description | +|-----------|-------------| +| `--provider google-cls-v2` | Use Google Cloud Batch (recommended over Life Sciences) | +| `--project` | GCP project with Batch API enabled | +| `--regions` | Compute region (e.g., `us-central1`) | +| `--image` | Docker image to run (e.g., `staphb/vadr:1.6.4`) | +| `--machine-type` | VM type (e.g., `n1-highmem-4` for 26GB RAM) | +| `--script` | Local shell script to execute inside the container | +| `--tasks` | TSV file defining one row per job (batch mode) | +| `--logging` | GCS path for stdout/stderr logs | + +## Task TSV Format + +The tasks TSV defines inputs, outputs, and environment variables for each job. +Header row uses column prefixes to declare types: + +``` +--env VAR1 --env VAR2 --input FASTA --output RESULT --output LOG +value1 value2 gs://bucket/input.fasta gs://bucket/output.txt gs://bucket/log.txt +``` + +- `--env NAME` -- environment variable passed to the script +- `--input NAME` -- GCS file downloaded to a local path; the env var is set to the local path +- `--output NAME` -- local path; after the script finishes, the file is uploaded to GCS + +Each non-header row is one job. All jobs run independently in parallel. + +## GCP Project and Bucket Scoping + +The service account running dsub jobs must have read/write access to all GCS paths +referenced in the tasks TSV. The simplest approach: + +1. Use a GCP project whose default service account already has access to your data +2. Use a bucket within that same project for staging intermediate/output files +3. For ephemeral results, use a temp bucket with a lifecycle policy (e.g., 30-day auto-delete) + +### Broad Viral Genomics Defaults + +Most developers on the viral-ngs codebase use: +- **GCP project**: `gcid-viral-seq` +- **Staging bucket**: `gs://viral-temp-30d` (30-day auto-delete lifecycle) + +These are not universal -- always confirm with the user before using them. + +## Monitoring Jobs + +```bash +# Check job status +dsub --provider google-cls-v2 --project --jobs --status + +# Or use dstat +dstat --provider google-cls-v2 --project --jobs --status '*' + +# View logs +gcloud storage cat gs:///logs//.log +``` + +## Tips + +- **Batch over single jobs**: Always prefer `--tasks` with a TSV over individual + `dsub` invocations. One TSV row per job is cleaner and easier to track. +- **Machine sizing**: Check your tool's memory requirements. VADR needs ~16GB; + use `n1-highmem-4` (26GB). Most tools work fine with `n1-standard-4` (15GB). +- **Script portability**: Write the `--script` to be self-contained. It receives + inputs/outputs as environment variables. Don't assume any local state. +- **Logging**: Always set `--logging` to a GCS path so you can debug failures. +- **Idempotency**: If re-running, dsub will create new jobs. Check for existing + outputs before re-submitting to avoid redundant computation. + +## Example: VADR Batch Analysis + +From the GATK-to-FreeBayes regression testing (PR #1053), we ran VADR on 30 FASTAs +(15 assemblies x old/new) using dsub: + +```bash +source ~/venvs/dsub/bin/activate # or wherever dsub is installed + +dsub --provider google-cls-v2 \ + --project gcid-viral-seq \ + --regions us-central1 \ + --image staphb/vadr:1.6.4 \ + --machine-type n1-highmem-4 \ + --script run_vadr.sh \ + --tasks vadr_tasks.tsv \ + --logging gs://viral-temp-30d/vadr_regression/logs/ +``` + +The tasks TSV had columns for VADR options (`--env VADR_OPTS`), model URL +(`--env MODEL_URL`), input FASTA (`--input FASTA`), and outputs +(`--output NUM_ALERTS`, `--output ALERTS_TSV`, `--output VADR_TGZ`). + +All 30 jobs completed in ~15 minutes total (running in parallel on GCP). diff --git a/.agents/skills/regression-testing/SKILL.md b/.agents/skills/regression-testing/SKILL.md new file mode 100644 index 000000000..0a11161d2 --- /dev/null +++ b/.agents/skills/regression-testing/SKILL.md @@ -0,0 +1,167 @@ +# Assembly Regression Testing + +End-to-end regression testing for assembly pipeline changes against Terra submissions. + +## When to Use + +Use this playbook when a PR makes functional changes to the assembly or variant-calling +pipeline (e.g., swapping variant callers, changing alignment parameters, modifying +consensus logic). It compares assembly outputs from old vs new code across hundreds +of real samples to validate equivalence or improvement. + +## Prerequisites + +- **gcloud CLI** -- authenticated with access to Terra workspace GCS buckets +- **mafft** -- for pairwise sequence alignment +- **Python** with pandas and matplotlib (e.g., a dataviz venv) +- **dsub** -- for running VADR batch jobs on GCP (see the `dsub-batch-jobs` skill) + +## Workflow + +### Step 1: Set Up Terra Submissions (Manual) + +The user must manually launch Terra submissions with old and new code: +1. Run the pipeline on a representative dataset using the **main branch** Docker image +2. Run the same pipeline on the same dataset using the **feature branch** Docker image +3. Note the submission IDs and workspace bucket for both runs + +### Step 2: Discover Paired Samples + +Use `discover_pairs.py` to find all comparable old/new sample pairs by crawling +GCS Cromwell output directories. + +```bash +python discover_pairs.py \ + --bucket \ + --old-sub \ + --new-sub \ + --output pairs.json +``` + +This produces a JSON mapping sample_name -> {old_tsv, new_tsv} for all samples +present in both submissions. + +### Step 3: Compare Assembly Outputs + +Use `compare_sample_pair.py` to compare each sample pair. This script: +- Downloads assembly_stats TSVs from GCS +- Compares metrics (coverage, % reference covered, length, etc.) +- Downloads FASTA assemblies and aligns them with mafft +- Reports SNPs, indels (events and bp), ambiguity diffs, and terminal extensions + +```bash +python compare_sample_pair.py \ + --old-tsv --new-tsv \ + --work-dir ./results/ \ + --output-json ./results/.json +``` + +For batch processing, iterate over all entries in `pairs.json` and invoke +`compare_sample_pair.py` for each sample pair (e.g., via a small wrapper +script using `concurrent.futures` or `xargs`/GNU `parallel`). + +### Step 4: Generate Report + +Use `generate_report.py` to aggregate all per-sample JSONs into plots and a markdown report. + +```bash +python generate_report.py \ + --results-dir ./results/ \ + --report-dir ./report/ \ + --workspace-name +``` + +Outputs: +- Summary TSV with per-assembly metrics +- 8 plots (scatter plots, histograms, identity distribution) +- Markdown report with summary tables and divergent assembly details + +### Step 5: (Optional) VADR Annotation Quality + +For assemblies with internal indel differences, run VADR to assess whether indels +cause frameshifts or other annotation problems. See the `dsub-batch-jobs` skill for +details on running batch jobs via dsub. + +Use `run_vadr.sh` with dsub to run VADR on each FASTA: + +```bash +dsub --provider google-cls-v2 \ + --project --regions us-central1 \ + --image staphb/vadr:1.6.4 \ + --machine-type n1-highmem-4 \ + --script run_vadr.sh \ + --tasks vadr_tasks.tsv \ + --logging gs:///vadr_logs/ +``` + +VADR model parameters come from the viral-references repo: +https://github.com/broadinstitute/viral-references/blob/main/annotation/vadr/vadr-by-taxid.tsv + +Use the taxid from the assembly_id (format: `sample_id-taxid`) to look up the +correct `vadr_opts`, `min_seq_len`, `max_seq_len`, `vadr_model_tar_url`, and +`vadr_model_tar_subdir`. + +### Step 6: Post Results + +Post the report as a PR comment. Before posting: +- **Self-review the proposed comment for confidential information** (sample names, + internal paths, credentials, etc.). Ask the user if in doubt about what is safe + to share publicly. +- Include plots as image attachments if the PR is on GitHub +- Attribute the analysis appropriately + +## Key Patterns + +### Per-Segment Alignment + +Multi-segment genomes (e.g., influenza with 8 segments) must be aligned +**per-segment**, not as a single concatenated sequence. Otherwise, terminal +effects at segment boundaries get misclassified as internal indels. + +The `compare_sample_pair.py` script handles this automatically: it pairs +segments by FASTA header, aligns each pair independently, analyzes each +alignment separately (so terminal effects stay terminal), and aggregates +the statistics. + +### Event Counting vs BP Counting + +For indels, both counts matter: +- **BP count**: Total gap positions (e.g., "49 bp of indels") +- **Event count**: Contiguous gap runs (e.g., "13 indel events") + +A single 26bp insertion is 1 event but 26 bp. Event counts better reflect +the number of variant-calling decisions that differ between old and new code. + +### VADR Frameshift Cascade Detection + +A single spurious 1bp indel in a coding region causes a cascade of VADR alerts: +1. `FRAMESHIFT` -- the indel shifts the reading frame +2. `STOP_CODON` -- premature stop codon in the shifted frame +3. `UNEXPECTED_LENGTH` -- protein length doesn't match model +4. `PEPTIDE_TRANSLATION_PROBLEM` -- for each downstream mature peptide + +When comparing VADR alert counts, a large delta (e.g., 32 -> 1) usually means +one version has frameshift-causing indels that the other avoids. Check the +`.alt.list` files to confirm which genes are affected. + +## Interpreting Results + +### What to Look For + +1. **Identity distribution**: Most assemblies should be 100% identical. Any + below 99.9% warrant investigation. +2. **SNP count = 0 for all assemblies**: Pipeline changes that only affect + indel calling (e.g., swapping variant callers) should produce zero SNPs. +3. **Indel events**: The number and nature of indel differences. Are they in + coding regions? Do they cause frameshifts? +4. **Coverage correlation**: Low-coverage samples (<10x) are most likely to + show differences between variant callers. +5. **VADR alert deltas**: Fewer alerts = more biologically plausible assembly. + Large improvements (e.g., -31 alerts) strongly favor the new code. + +### Red Flags + +- Assemblies present in old but missing in new (or vice versa) +- SNPs introduced where none existed before +- VADR alerts increasing significantly for the new code +- Differences concentrated in specific organisms/taxids diff --git a/.agents/skills/regression-testing/compare_sample_pair.py b/.agents/skills/regression-testing/compare_sample_pair.py new file mode 100644 index 000000000..5b5000072 --- /dev/null +++ b/.agents/skills/regression-testing/compare_sample_pair.py @@ -0,0 +1,514 @@ +#!/usr/bin/env python3 +"""Compare assembly outputs between old and new code for a single sample pair. + +Takes two GCS URIs pointing at assembly_metadata TSV files (old and new), +downloads them, compares metrics, aligns FASTAs with mafft, and outputs a JSON result. +""" +import argparse +import csv +import io +import json +import logging +import os +import subprocess + +logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') +log = logging.getLogger(__name__) + +METRICS_COLS = [ + 'assembly_length', 'assembly_length_unambiguous', 'reads_aligned', + 'mean_coverage', 'percent_reference_covered', 'reference_length', + 'scaffolding_num_segments_recovered', 'reference_num_segments_required', +] +FLOAT_COLS = {'mean_coverage', 'percent_reference_covered'} +INT_COLS = {'assembly_length', 'assembly_length_unambiguous', 'reads_aligned', + 'reference_length', 'scaffolding_num_segments_recovered', + 'reference_num_segments_required'} + + +def gcloud_cat(gcs_uri): + """Read a GCS file's contents as a string.""" + result = subprocess.run( + ['gcloud', 'storage', 'cat', gcs_uri], + capture_output=True, text=True, timeout=120 + ) + if result.returncode != 0: + raise RuntimeError(f"gcloud storage cat failed for {gcs_uri}: {result.stderr.strip()}") + return result.stdout + + +def gcloud_cp(gcs_uri, local_path): + """Download a GCS file to local path.""" + result = subprocess.run( + ['gcloud', 'storage', 'cp', gcs_uri, local_path], + capture_output=True, text=True, timeout=120 + ) + if result.returncode != 0: + raise RuntimeError(f"gcloud storage cp failed for {gcs_uri}: {result.stderr.strip()}") + + +def parse_tsv(content): + """Parse assembly_metadata TSV content into a dict keyed by assembly_id. + + Returns dict: assembly_id -> {col: value, ...} + """ + reader = csv.DictReader(io.StringIO(content), delimiter='\t') + rows = {} + for row in reader: + # The first column is 'entity:assembly_id' + assembly_id = row.get('entity:assembly_id', '').strip() + if not assembly_id: + continue + # Parse numeric columns + parsed = dict(row) + for col in FLOAT_COLS: + if col in parsed and parsed[col]: + try: + parsed[col] = float(parsed[col]) + except (ValueError, TypeError): + parsed[col] = None + for col in INT_COLS: + if col in parsed and parsed[col]: + try: + parsed[col] = int(parsed[col]) + except (ValueError, TypeError): + parsed[col] = None + rows[assembly_id] = parsed + return rows + + +def parse_fasta(content): + """Parse a FASTA string into list of (header, sequence) tuples.""" + sequences = [] + current_header = None + current_seq = [] + for line in content.strip().split('\n'): + line = line.strip() + if not line: + continue + if line.startswith('>'): + if current_header is not None: + sequences.append((current_header, ''.join(current_seq))) + current_header = line[1:].strip() + current_seq = [] + else: + current_seq.append(line) + if current_header is not None: + sequences.append((current_header, ''.join(current_seq))) + return sequences + + +def run_mafft_pair(old_seq, new_seq, work_dir, pair_id='0'): + """Run mafft on a single pair of sequences. Returns aligned (old_seq, new_seq) strings.""" + combined = os.path.join(work_dir, f'combined_{pair_id}.fasta') + with open(combined, 'w') as f: + f.write(f'>old_{pair_id}\n{old_seq}\n>new_{pair_id}\n{new_seq}\n') + + try: + result = subprocess.run( + ['mafft', '--auto', '--preservecase', '--quiet', '--thread', '1', combined], + capture_output=True, text=True, timeout=300 + ) + finally: + os.unlink(combined) + if result.returncode != 0: + raise RuntimeError(f"mafft failed: {result.stderr.strip()}") + + seqs = parse_fasta(result.stdout) + if len(seqs) != 2: + raise RuntimeError(f"Expected 2 sequences from mafft, got {len(seqs)}") + return seqs[0][1], seqs[1][1] + + +def align_and_analyze_fastas(old_fasta_path, new_fasta_path, work_dir): + """Align old vs new FASTAs, handling multi-segment genomes. + + For single-segment: runs one mafft alignment and analyzes it. + For multi-segment: pairs segments by header, aligns each pair separately, + analyzes each independently (so terminal effects stay terminal), and + aggregates the stats. + + Returns (aligned_fasta_path, stats_dict). + """ + with open(old_fasta_path) as f: + old_fasta_text = f.read() + with open(new_fasta_path) as f: + new_fasta_text = f.read() + old_seqs = parse_fasta(old_fasta_text) + new_seqs = parse_fasta(new_fasta_text) + + if len(old_seqs) == 1 and len(new_seqs) == 1: + # Simple case: single segment + aln_old, aln_new = run_mafft_pair(old_seqs[0][1], new_seqs[0][1], work_dir) + aligned = os.path.join(work_dir, 'aligned.fasta') + with open(aligned, 'w') as f: + f.write(f'>old_{old_seqs[0][0]}\n{aln_old}\n>new_{new_seqs[0][0]}\n{aln_new}\n') + stats = analyze_alignment_seqs(aln_old, aln_new) + return aligned, stats + + # Multi-segment: pair by header name + old_dict = {h: s for h, s in old_seqs} + new_dict = {h: s for h, s in new_seqs} + common_headers = [h for h in old_dict if h in new_dict] + + if not common_headers: + if len(old_seqs) == len(new_seqs): + log.info(f" Multi-segment: headers don't match, pairing by position ({len(old_seqs)} segments)") + common_headers = None + else: + raise RuntimeError( + f"Cannot pair segments: {len(old_seqs)} old vs {len(new_seqs)} new, no matching headers") + + # Align each pair and analyze independently + segment_pairs = [] + if common_headers is not None: + for i, h in enumerate(common_headers): + aln_old, aln_new = run_mafft_pair(old_dict[h], new_dict[h], work_dir, pair_id=str(i)) + segment_pairs.append((h, aln_old, aln_new)) + else: + for i in range(len(old_seqs)): + aln_old, aln_new = run_mafft_pair(old_seqs[i][1], new_seqs[i][1], work_dir, pair_id=str(i)) + segment_pairs.append((old_seqs[i][0], aln_old, aln_new)) + + n_segments = len(segment_pairs) + log.info(f" Multi-segment alignment: {n_segments} segments aligned") + + # Analyze each segment independently, then aggregate + agg = { + 'alignment_length': 0, 'internal_length': 0, + 'matches': 0, 'snps': 0, + 'internal_insertions': 0, 'internal_deletions': 0, + 'internal_insertion_events': 0, 'internal_deletion_events': 0, + 'ambiguity_diffs': 0, + 'terminal_old_left': 0, 'terminal_new_left': 0, + 'terminal_old_right': 0, 'terminal_new_right': 0, + 'terminal_extensions_old': 0, 'terminal_extensions_new': 0, + 'terminal_extension_events_old': 0, 'terminal_extension_events_new': 0, + } + per_segment = [] + for header, aln_old, aln_new in segment_pairs: + seg_stats = analyze_alignment_seqs(aln_old, aln_new) + per_segment.append({'segment': header, **seg_stats}) + for key in agg: + agg[key] += seg_stats.get(key, 0) + + # Compute aggregate identity + total_bases_compared = agg['matches'] + agg['snps'] + agg['ambiguity_diffs'] + agg['identity'] = agg['matches'] / total_bases_compared if total_bases_compared > 0 else 1.0 + agg['n_segments'] = n_segments + agg['per_segment'] = per_segment + + # Write per-segment alignment file (for review) + aligned = os.path.join(work_dir, 'aligned.fasta') + with open(aligned, 'w') as f: + for header, aln_old, aln_new in segment_pairs: + f.write(f'>old_{header}\n{aln_old}\n>new_{header}\n{aln_new}\n') + + return aligned, agg + + +def analyze_alignment_seqs(old_seq_str, new_seq_str): + """Analyze a pairwise alignment given two aligned sequence strings. + + Returns dict with alignment statistics. + """ + old_seq = old_seq_str.upper() + new_seq = new_seq_str.upper() + aln_len = len(old_seq) + + if len(new_seq) != aln_len: + raise RuntimeError(f"Alignment sequences differ in length: {aln_len} vs {len(new_seq)}") + + ACGT = set('ACGT') + + # Find the internal region (where both sequences have bases) + left_bound = 0 + while left_bound < aln_len and (old_seq[left_bound] == '-' or new_seq[left_bound] == '-'): + left_bound += 1 + + right_bound = aln_len - 1 + while right_bound >= 0 and (old_seq[right_bound] == '-' or new_seq[right_bound] == '-'): + right_bound -= 1 + + # Terminal extension counts (bp and events) + terminal_old_left = 0 + terminal_new_left = 0 + terminal_old_right = 0 + terminal_new_right = 0 + # Track events: a contiguous run of gaps on one side = 1 event + terminal_old_left_events = 0 + terminal_new_left_events = 0 + terminal_old_right_events = 0 + terminal_new_right_events = 0 + + prev_old_gap = False + prev_new_gap = False + for i in range(left_bound): + if old_seq[i] != '-' and new_seq[i] == '-': + terminal_old_left += 1 + if not prev_old_gap: + terminal_old_left_events += 1 + prev_old_gap = True + prev_new_gap = False + elif new_seq[i] != '-' and old_seq[i] == '-': + terminal_new_left += 1 + if not prev_new_gap: + terminal_new_left_events += 1 + prev_new_gap = True + prev_old_gap = False + else: + prev_old_gap = False + prev_new_gap = False + + prev_old_gap = False + prev_new_gap = False + for i in range(right_bound + 1, aln_len): + if old_seq[i] != '-' and new_seq[i] == '-': + terminal_old_right += 1 + if not prev_old_gap: + terminal_old_right_events += 1 + prev_old_gap = True + prev_new_gap = False + elif new_seq[i] != '-' and old_seq[i] == '-': + terminal_new_right += 1 + if not prev_new_gap: + terminal_new_right_events += 1 + prev_new_gap = True + prev_old_gap = False + else: + prev_old_gap = False + prev_new_gap = False + + # Analyze internal region (bp counts and event counts) + matches = 0 + snps = 0 + internal_insertions = 0 # bp count + internal_deletions = 0 # bp count + internal_insertion_events = 0 + internal_deletion_events = 0 + ambiguity_diffs = 0 + in_insertion = False + in_deletion = False + + for i in range(left_bound, right_bound + 1): + o = old_seq[i] + n = new_seq[i] + + if o == '-' and n != '-': + internal_insertions += 1 + if not in_insertion: + internal_insertion_events += 1 + in_insertion = True + in_deletion = False + elif o != '-' and n == '-': + internal_deletions += 1 + if not in_deletion: + internal_deletion_events += 1 + in_deletion = True + in_insertion = False + else: + in_insertion = False + in_deletion = False + if o == n: + if o != '-': + matches += 1 + elif o in ACGT and n in ACGT: + snps += 1 + else: + ambiguity_diffs += 1 + + total_internal = right_bound - left_bound + 1 if right_bound >= left_bound else 0 + total_bases_compared = matches + snps + ambiguity_diffs + identity = matches / total_bases_compared if total_bases_compared > 0 else 1.0 + + return { + 'alignment_length': aln_len, + 'internal_length': total_internal, + 'matches': matches, + 'snps': snps, + 'internal_insertions': internal_insertions, + 'internal_deletions': internal_deletions, + 'internal_insertion_events': internal_insertion_events, + 'internal_deletion_events': internal_deletion_events, + 'ambiguity_diffs': ambiguity_diffs, + 'terminal_old_left': terminal_old_left, + 'terminal_new_left': terminal_new_left, + 'terminal_old_right': terminal_old_right, + 'terminal_new_right': terminal_new_right, + 'terminal_extensions_old': terminal_old_left + terminal_old_right, + 'terminal_extensions_new': terminal_new_left + terminal_new_right, + 'terminal_extension_events_old': terminal_old_left_events + terminal_old_right_events, + 'terminal_extension_events_new': terminal_new_left_events + terminal_new_right_events, + 'identity': identity, + } + + +def analyze_alignment(aligned_fasta_path): + """Analyze pairwise alignment from mafft output file. Thin wrapper around analyze_alignment_seqs.""" + with open(aligned_fasta_path) as f: + seqs = parse_fasta(f.read()) + if len(seqs) != 2: + raise RuntimeError(f"Expected 2 sequences in alignment, got {len(seqs)}") + return analyze_alignment_seqs(seqs[0][1], seqs[1][1]) + + +def compare_assembly(assembly_id, old_row, new_row, work_dir): + """Compare one assembly (old vs new). Downloads FASTAs, runs mafft, returns result dict.""" + result = { + 'assembly_id': assembly_id, + 'taxid': old_row.get('taxid', ''), + 'tax_name': old_row.get('tax_name', ''), + 'old_metrics': {}, + 'new_metrics': {}, + 'deltas': {}, + 'alignment': None, + 'error': None, + } + + # Extract metrics + for col in METRICS_COLS: + old_val = old_row.get(col) + new_val = new_row.get(col) + result['old_metrics'][col] = old_val + result['new_metrics'][col] = new_val + if old_val is not None and new_val is not None: + try: + result['deltas'][col] = float(new_val) - float(old_val) + except (ValueError, TypeError): + result['deltas'][col] = None + + # Get FASTA paths + old_fasta_uri = old_row.get('assembly_fasta', '').strip() + new_fasta_uri = new_row.get('assembly_fasta', '').strip() + + if not old_fasta_uri or not new_fasta_uri: + log.info(f" {assembly_id}: skipping alignment (missing FASTA URI)") + return result + + # Download FASTAs + aln_dir = os.path.join(work_dir, f'aln_{assembly_id}') + os.makedirs(aln_dir, exist_ok=True) + + old_fasta_local = os.path.join(aln_dir, 'old.fasta') + new_fasta_local = os.path.join(aln_dir, 'new.fasta') + + try: + log.debug(f" Downloading old FASTA: {old_fasta_uri}") + gcloud_cp(old_fasta_uri, old_fasta_local) + log.debug(f" Downloading new FASTA: {new_fasta_uri}") + gcloud_cp(new_fasta_uri, new_fasta_local) + + # Check if FASTAs are non-empty + if os.path.getsize(old_fasta_local) == 0 or os.path.getsize(new_fasta_local) == 0: + log.info(f" {assembly_id}: skipping alignment (empty FASTA)") + return result + + # Align and analyze (handles multi-segment genomes independently) + log.debug(f" Running alignment for {assembly_id}") + aligned_path, alignment_stats = align_and_analyze_fastas(old_fasta_local, new_fasta_local, aln_dir) + result['alignment'] = alignment_stats + + identity = alignment_stats['identity'] + snps = alignment_stats['snps'] + indels = alignment_stats['internal_insertions'] + alignment_stats['internal_deletions'] + log.info(f" {assembly_id}: identity={identity:.6f}, snps={snps}, indels={indels}, " + f"terminal_ext_old={alignment_stats['terminal_extensions_old']}, " + f"terminal_ext_new={alignment_stats['terminal_extensions_new']}") + + except Exception as e: + log.error(f" {assembly_id}: alignment failed: {e}") + result['error'] = str(e) + finally: + # Cleanup downloaded files + for f in [old_fasta_local, new_fasta_local]: + if os.path.exists(f): + os.unlink(f) + # Keep aligned file only if identity < 99.9% + aligned_file = os.path.join(aln_dir, 'aligned.fasta') + if os.path.exists(aligned_file): + if result['alignment'] and result['alignment']['identity'] >= 0.999: + os.unlink(aligned_file) + else: + log.info(f" Keeping alignment file for review: {aligned_file}") + # Remove empty dir + try: + os.rmdir(aln_dir) + except OSError: + pass # dir not empty (kept alignment file) + + return result + + +def main(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument('--old-tsv', required=True, help='GCS URI of old assembly_metadata TSV') + parser.add_argument('--new-tsv', required=True, help='GCS URI of new assembly_metadata TSV') + parser.add_argument('--work-dir', required=True, help='Working directory for temp files') + parser.add_argument('--output-json', required=True, help='Output JSON file path') + parser.add_argument('--verbose', '-v', action='store_true', help='Enable debug logging') + args = parser.parse_args() + + if args.verbose: + logging.getLogger().setLevel(logging.DEBUG) + + os.makedirs(args.work_dir, exist_ok=True) + + # Download and parse TSVs + log.info(f"Downloading old TSV: {args.old_tsv}") + old_content = gcloud_cat(args.old_tsv) + old_rows = parse_tsv(old_content) + + log.info(f"Downloading new TSV: {args.new_tsv}") + new_content = gcloud_cat(args.new_tsv) + new_rows = parse_tsv(new_content) + + # Extract sample_id from first row (or from TSV filename) + sample_id = 'unknown' + if old_rows: + first_row = next(iter(old_rows.values())) + sample_id = first_row.get('sample_id', 'unknown') + elif new_rows: + first_row = next(iter(new_rows.values())) + sample_id = first_row.get('sample_id', 'unknown') + + log.info(f"Sample: {sample_id}") + log.info(f"Old assemblies: {len(old_rows)}, New assemblies: {len(new_rows)}") + + # Find intersecting assembly_ids + common_ids = sorted(set(old_rows.keys()) & set(new_rows.keys())) + old_only_ids = sorted(set(old_rows.keys()) - set(new_rows.keys())) + new_only_ids = sorted(set(new_rows.keys()) - set(old_rows.keys())) + + if old_only_ids: + log.info(f" Assemblies only in old: {old_only_ids}") + if new_only_ids: + log.info(f" Assemblies only in new: {new_only_ids}") + log.info(f" Assemblies in common: {len(common_ids)}") + + # Compare each intersecting assembly + comparisons = [] + for aid in common_ids: + comp = compare_assembly(aid, old_rows[aid], new_rows[aid], args.work_dir) + comparisons.append(comp) + + # Build output + output = { + 'sample_id': sample_id, + 'old_tsv_uri': args.old_tsv, + 'new_tsv_uri': args.new_tsv, + 'old_assembly_count': len(old_rows), + 'new_assembly_count': len(new_rows), + 'assembly_count_match': len(old_rows) == len(new_rows), + 'assemblies_only_in_old': old_only_ids, + 'assemblies_only_in_new': new_only_ids, + 'comparisons': comparisons, + } + + with open(args.output_json, 'w') as f: + json.dump(output, f, indent=2) + + log.info(f"Wrote results to {args.output_json}") + + +if __name__ == '__main__': + main() diff --git a/.agents/skills/regression-testing/discover_pairs.py b/.agents/skills/regression-testing/discover_pairs.py new file mode 100644 index 000000000..5fa9c82c9 --- /dev/null +++ b/.agents/skills/regression-testing/discover_pairs.py @@ -0,0 +1,162 @@ +#!/usr/bin/env python3 +"""Discover comparable old/new sample pairs by crawling GCS Cromwell output directories. + +For each submission, finds assembly_metadata TSV files named +``assembly_metadata-.tsv``, extracts sample names from filenames, +and outputs the intersection as a JSON mapping. + +Usage: + python discover_pairs.py \ + --bucket fc-XXXXXXXX-... \ + --old-sub \ + --new-sub \ + -o pairs.json +""" +import argparse +import json +import logging +import re +import subprocess + +logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') +log = logging.getLogger(__name__) + + +def gcloud_ls(path): + """List GCS path, return list of URIs. Returns [] on error.""" + try: + result = subprocess.run( + ['gcloud', 'storage', 'ls', path], + capture_output=True, text=True, timeout=60 + ) + if result.returncode != 0: + log.warning(f"gcloud ls non-zero exit for {path}: {result.stderr.strip()}") + return [] + return [line.strip() for line in result.stdout.strip().split('\n') if line.strip()] + except Exception as e: + log.warning(f"gcloud ls failed for {path}: {e}") + return [] + + +def find_tsv_in_call_dir(call_dir_uri): + """Find assembly_metadata TSV in a call directory, handling attempt-N subdirs. + + Returns (sample_name, tsv_uri) or (None, None). + """ + items = gcloud_ls(call_dir_uri) + + # Check for attempt-N subdirectories + def attempt_sort_key(path): + match = re.search(r'/attempt-(\d+)', path) + return int(match.group(1)) if match else 0 + attempt_dirs = sorted([i for i in items if '/attempt-' in i], + key=attempt_sort_key, reverse=True) + tsv_files = [i for i in items if i.endswith('.tsv')] + + # If there are attempt dirs, check the highest attempt first + if attempt_dirs: + for attempt_dir in attempt_dirs: + attempt_items = gcloud_ls(attempt_dir) + attempt_tsvs = [i for i in attempt_items if i.endswith('.tsv')] + if attempt_tsvs: + tsv_files = attempt_tsvs + break + + for tsv in tsv_files: + match = re.search(r'assembly_metadata-(.+)\.tsv$', tsv) + if match: + return match.group(1), tsv + + return None, None + + +def discover_submission_tsvs(bucket, submission_id): + """Find all assembly_metadata TSVs for a submission. + + Returns dict: sample_name -> tsv_gcs_uri + """ + base = f"gs://{bucket}/submissions/{submission_id}/assemble_denovo_metagenomic/" + log.info(f"Listing workflow directories in {base}") + + wf_dirs = gcloud_ls(base) + log.info(f"Found {len(wf_dirs)} workflow directories") + + results = {} + for i, wf_dir in enumerate(wf_dirs): + if i % 20 == 0: + log.info(f" Scanning workflow {i+1}/{len(wf_dirs)}...") + + for call_name in ['call-assembly_stats_non_empty', 'call-assembly_stats_empty']: + call_dir = f"{wf_dir}{call_name}/" + sample_name, tsv_uri = find_tsv_in_call_dir(call_dir) + if sample_name: + if sample_name in results: + log.warning(f"Duplicate sample {sample_name} -- keeping first occurrence") + else: + results[sample_name] = tsv_uri + break + + log.info(f"Found TSVs for {len(results)} samples in submission {submission_id[:8]}") + return results + + +def main(): + parser = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument('--bucket', required=True, + help='Terra workspace GCS bucket ID (e.g., fc-XXXXXXXX-...)') + parser.add_argument('--old-sub', required=True, + help='Old submission ID (main branch)') + parser.add_argument('--new-sub', required=True, + help='New submission ID (feature branch)') + parser.add_argument('--output', '-o', required=True, + help='Output JSON file path') + args = parser.parse_args() + + log.info(f"Old submission: {args.old_sub[:8]}") + log.info(f"New submission: {args.new_sub[:8]}") + + old_tsvs = discover_submission_tsvs(args.bucket, args.old_sub) + new_tsvs = discover_submission_tsvs(args.bucket, args.new_sub) + + # Find intersection + common_samples = sorted(set(old_tsvs.keys()) & set(new_tsvs.keys())) + old_only = sorted(set(old_tsvs.keys()) - set(new_tsvs.keys())) + new_only = sorted(set(new_tsvs.keys()) - set(old_tsvs.keys())) + + log.info(f"Old-only samples: {len(old_only)}") + log.info(f"New-only samples: {len(new_only)}") + log.info(f"Intersecting samples: {len(common_samples)}") + + if old_only: + log.info(f" Old-only: {old_only[:5]}{'...' if len(old_only) > 5 else ''}") + if new_only: + log.info(f" New-only: {new_only[:5]}{'...' if len(new_only) > 5 else ''}") + + pairs = {} + for sample in common_samples: + pairs[sample] = { + 'old_tsv': old_tsvs[sample], + 'new_tsv': new_tsvs[sample], + } + + output = { + 'bucket': args.bucket, + 'old_submission': args.old_sub, + 'new_submission': args.new_sub, + 'old_sample_count': len(old_tsvs), + 'new_sample_count': len(new_tsvs), + 'paired_count': len(pairs), + 'old_only': old_only, + 'new_only': new_only, + 'pairs': pairs, + } + + with open(args.output, 'w') as f: + json.dump(output, f, indent=2) + + log.info(f"Wrote {len(pairs)} pairs to {args.output}") + + +if __name__ == '__main__': + main() diff --git a/.agents/skills/regression-testing/generate_report.py b/.agents/skills/regression-testing/generate_report.py new file mode 100644 index 000000000..d6dda318a --- /dev/null +++ b/.agents/skills/regression-testing/generate_report.py @@ -0,0 +1,445 @@ +#!/usr/bin/env python3 +"""Generate regression testing report with plots from per-sample JSON results. + +Aggregates all comparison results, produces summary TSV, plots, and markdown report. +""" +import argparse +import glob +import json +import logging +import os + +logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') +log = logging.getLogger(__name__) + +# Delay imports so script can show usage without these deps +def get_deps(): + import pandas as pd + import matplotlib + matplotlib.use('Agg') + import matplotlib.pyplot as plt + return pd, plt + + +def load_results(results_dir): + """Load all per-sample JSON results.""" + results = [] + for path in sorted(glob.glob(os.path.join(results_dir, '*.json'))): + try: + with open(path) as f: + results.append(json.load(f)) + except Exception as e: + log.warning(f"Failed to load {path}: {e}") + return results + + +def build_comparison_table(results, workspace_name): + """Build a flat table of per-assembly comparisons.""" + rows = [] + for r in results: + sample_id = r.get('sample_id', 'unknown') + for comp in r.get('comparisons', []): + row = { + 'workspace': workspace_name, + 'sample_id': sample_id, + 'assembly_id': comp.get('assembly_id', ''), + 'taxid': comp.get('taxid', ''), + 'tax_name': comp.get('tax_name', ''), + } + # Metrics + for col in ['percent_reference_covered', 'mean_coverage', + 'assembly_length_unambiguous', 'assembly_length', + 'reads_aligned', 'reference_length']: + old_val = comp.get('old_metrics', {}).get(col) + new_val = comp.get('new_metrics', {}).get(col) + delta = comp.get('deltas', {}).get(col) + row[f'old_{col}'] = old_val + row[f'new_{col}'] = new_val + row[f'delta_{col}'] = delta + + # Alignment stats + aln = comp.get('alignment') + if aln: + row['alignment_identity'] = aln.get('identity') + row['snp_count'] = aln.get('snps', 0) + row['internal_insertions'] = aln.get('internal_insertions', 0) + row['internal_deletions'] = aln.get('internal_deletions', 0) + row['indel_count_bp'] = aln.get('internal_insertions', 0) + aln.get('internal_deletions', 0) + row['indel_count_events'] = aln.get('internal_insertion_events', 0) + aln.get('internal_deletion_events', 0) + row['terminal_extensions_old'] = aln.get('terminal_extensions_old', 0) + row['terminal_extensions_new'] = aln.get('terminal_extensions_new', 0) + row['terminal_extension_events_old'] = aln.get('terminal_extension_events_old', 0) + row['terminal_extension_events_new'] = aln.get('terminal_extension_events_new', 0) + row['ambiguity_diffs'] = aln.get('ambiguity_diffs', 0) + else: + row['alignment_identity'] = None + row['snp_count'] = None + row['indel_count_bp'] = None + row['indel_count_events'] = None + row['terminal_extensions_old'] = None + row['terminal_extensions_new'] = None + row['terminal_extension_events_old'] = None + row['terminal_extension_events_new'] = None + row['ambiguity_diffs'] = None + + row['error'] = comp.get('error') + rows.append(row) + return rows + + +def build_sample_summary(results): + """Build sample-level summary table.""" + rows = [] + for r in results: + rows.append({ + 'sample_id': r.get('sample_id', 'unknown'), + 'old_assembly_count': r.get('old_assembly_count', 0), + 'new_assembly_count': r.get('new_assembly_count', 0), + 'assembly_count_match': r.get('assembly_count_match', False), + 'assemblies_only_in_old': len(r.get('assemblies_only_in_old', [])), + 'assemblies_only_in_new': len(r.get('assemblies_only_in_new', [])), + 'num_comparisons': len(r.get('comparisons', [])), + }) + return rows + + +def generate_plots(df, plot_dir): + """Generate all plots from the comparison dataframe.""" + _, plt = get_deps() + os.makedirs(plot_dir, exist_ok=True) + + # Filter to rows with actual assemblies + df_asm = df[df['old_percent_reference_covered'].notna() & + df['new_percent_reference_covered'].notna()].copy() + + if len(df_asm) == 0: + log.warning("No assembly comparisons with metrics — skipping plots") + return + + # 1. Percent reference covered scatter + fig, ax = plt.subplots(figsize=(8, 8)) + ax.scatter(df_asm['old_percent_reference_covered'] * 100, + df_asm['new_percent_reference_covered'] * 100, + alpha=0.5, s=20) + lims = [0, 105] + ax.plot(lims, lims, 'r--', alpha=0.5, label='y=x') + ax.set_xlabel('Old % Reference Covered') + ax.set_ylabel('New % Reference Covered') + ax.set_title('Percent Reference Covered: Old vs New') + ax.set_xlim(lims) + ax.set_ylim(lims) + ax.legend() + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'pct_ref_covered_scatter.png'), dpi=150) + plt.close(fig) + + # 2. Mean coverage scatter (log scale) + df_cov = df_asm[df_asm['old_mean_coverage'].notna() & + df_asm['new_mean_coverage'].notna() & + (df_asm['old_mean_coverage'] > 0) & + (df_asm['new_mean_coverage'] > 0)].copy() + if len(df_cov) > 0: + fig, ax = plt.subplots(figsize=(8, 8)) + ax.scatter(df_cov['old_mean_coverage'], df_cov['new_mean_coverage'], + alpha=0.5, s=20) + min_v = min(df_cov['old_mean_coverage'].min(), df_cov['new_mean_coverage'].min()) * 0.8 + max_v = max(df_cov['old_mean_coverage'].max(), df_cov['new_mean_coverage'].max()) * 1.2 + ax.plot([min_v, max_v], [min_v, max_v], 'r--', alpha=0.5, label='y=x') + ax.set_xscale('log') + ax.set_yscale('log') + ax.set_xlabel('Old Mean Coverage') + ax.set_ylabel('New Mean Coverage') + ax.set_title('Mean Coverage: Old vs New') + ax.legend() + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'mean_coverage_scatter.png'), dpi=150) + plt.close(fig) + + # 3. Assembly length scatter + df_len = df_asm[df_asm['old_assembly_length_unambiguous'].notna() & + df_asm['new_assembly_length_unambiguous'].notna()].copy() + if len(df_len) > 0: + fig, ax = plt.subplots(figsize=(8, 8)) + ax.scatter(df_len['old_assembly_length_unambiguous'], + df_len['new_assembly_length_unambiguous'], + alpha=0.5, s=20) + min_v = min(df_len['old_assembly_length_unambiguous'].min(), + df_len['new_assembly_length_unambiguous'].min()) * 0.95 + max_v = max(df_len['old_assembly_length_unambiguous'].max(), + df_len['new_assembly_length_unambiguous'].max()) * 1.05 + ax.plot([min_v, max_v], [min_v, max_v], 'r--', alpha=0.5, label='y=x') + ax.set_xlabel('Old Unambiguous Length') + ax.set_ylabel('New Unambiguous Length') + ax.set_title('Assembly Length (Unambiguous): Old vs New') + ax.legend() + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'assembly_length_scatter.png'), dpi=150) + plt.close(fig) + + # 4. Delta pct_ref_covered histogram + deltas = df_asm['delta_percent_reference_covered'].dropna() * 100 + if len(deltas) > 0: + fig, ax = plt.subplots(figsize=(8, 5)) + ax.hist(deltas, bins=50, edgecolor='black', alpha=0.7) + ax.axvline(0, color='r', linestyle='--', alpha=0.5) + ax.set_xlabel('Delta % Reference Covered (New - Old)') + ax.set_ylabel('Count') + ax.set_title(f'Distribution of % Reference Covered Changes (n={len(deltas)})') + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'delta_pct_ref_covered_hist.png'), dpi=150) + plt.close(fig) + + # 5. Alignment identity histogram + df_aln = df_asm[df_asm['alignment_identity'].notna()].copy() + if len(df_aln) > 0: + fig, ax = plt.subplots(figsize=(8, 5)) + identities = df_aln['alignment_identity'] * 100 + ax.hist(identities, bins=50, edgecolor='black', alpha=0.7) + ax.set_xlabel('Pairwise Identity (%)') + ax.set_ylabel('Count') + ax.set_title(f'Assembly Identity Distribution (n={len(identities)})') + ax.axvline(100, color='r', linestyle='--', alpha=0.3) + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'alignment_identity_hist.png'), dpi=150) + plt.close(fig) + + # 6. SNP and indel count histograms + if len(df_aln) > 0: + fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + + snps = df_aln['snp_count'].dropna() + if len(snps) > 0: + max_snp = int(snps.max()) + bins_snp = range(0, max(max_snp + 2, 3)) + axes[0].hist(snps, bins=bins_snp, edgecolor='black', alpha=0.7) + axes[0].set_xlabel('SNP Count') + axes[0].set_ylabel('Number of Assemblies') + axes[0].set_title(f'SNPs per Assembly (n={len(snps)})') + + indels = df_aln['indel_count_events'].dropna() + if len(indels) > 0: + max_indel = int(indels.max()) + bins_indel = range(0, max(max_indel + 2, 3)) + axes[1].hist(indels, bins=bins_indel, edgecolor='black', alpha=0.7) + axes[1].set_xlabel('Indel Count (internal)') + axes[1].set_ylabel('Number of Assemblies') + axes[1].set_title(f'Internal Indels per Assembly (n={len(indels)})') + + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'snp_indel_counts.png'), dpi=150) + plt.close(fig) + + # 7. Terminal extensions histogram + if len(df_aln) > 0: + ext_old = df_aln['terminal_extensions_old'].dropna() + ext_new = df_aln['terminal_extensions_new'].dropna() + has_ext = (ext_old > 0) | (ext_new > 0) + if has_ext.any(): + fig, ax = plt.subplots(figsize=(8, 5)) + ax.hist(ext_new[has_ext], bins=30, alpha=0.6, label='New extends beyond Old', edgecolor='black') + ax.hist(ext_old[has_ext], bins=30, alpha=0.6, label='Old extends beyond New', edgecolor='black') + ax.set_xlabel('Terminal Extension (bp)') + ax.set_ylabel('Count') + ax.set_title('Terminal Extensions') + ax.legend() + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'terminal_extensions_hist.png'), dpi=150) + plt.close(fig) + + # 8. Coverage vs identity + if len(df_aln) > 0: + df_ci = df_aln[df_aln['old_mean_coverage'].notna() & (df_aln['old_mean_coverage'] > 0)].copy() + if len(df_ci) > 0: + fig, ax = plt.subplots(figsize=(8, 5)) + ax.scatter(df_ci['old_mean_coverage'], df_ci['alignment_identity'] * 100, + alpha=0.5, s=20) + ax.set_xscale('log') + ax.set_xlabel('Mean Coverage') + ax.set_ylabel('Pairwise Identity (%)') + ax.set_title('Coverage vs Assembly Identity') + fig.tight_layout() + fig.savefig(os.path.join(plot_dir, 'coverage_vs_identity.png'), dpi=150) + plt.close(fig) + + log.info(f"Generated plots in {plot_dir}") + + +def generate_markdown_report(df, sample_df, workspace_name, report_dir, plot_dir): + """Generate markdown report.""" + pd, _ = get_deps() + + total_samples = len(sample_df) + if sample_df.empty or 'old_assembly_count' not in sample_df.columns: + samples_with_assemblies = 0 + samples_count_match = 0 + else: + samples_with_assemblies = len(sample_df[sample_df['old_assembly_count'] > 0]) + samples_count_match = len(sample_df[sample_df['assembly_count_match']]) + samples_count_mismatch = total_samples - samples_count_match + + total_assemblies = len(df) + if df.empty or 'alignment_identity' not in df.columns: + df_aln = pd.DataFrame() + else: + df_aln = df[df['alignment_identity'].notna()] + + identical = len(df_aln[df_aln['alignment_identity'] >= 1.0]) if len(df_aln) > 0 else 0 + near_identical = len(df_aln[(df_aln['alignment_identity'] >= 0.999) & (df_aln['alignment_identity'] < 1.0)]) if len(df_aln) > 0 else 0 + minor_diff = len(df_aln[(df_aln['alignment_identity'] >= 0.99) & (df_aln['alignment_identity'] < 0.999)]) if len(df_aln) > 0 else 0 + significant_diff = len(df_aln[df_aln['alignment_identity'] < 0.99]) if len(df_aln) > 0 else 0 + + if len(df_aln) > 0 and 'snp_count' in df_aln.columns: + with_snps = len(df_aln[df_aln['snp_count'] > 0]) + with_indels = len(df_aln[df_aln['indel_count_events'] > 0]) + with_ambig = len(df_aln[df_aln['ambiguity_diffs'] > 0]) + with_terminal = len(df_aln[(df_aln['terminal_extensions_old'] > 0) | (df_aln['terminal_extensions_new'] > 0)]) + total_snps = int(df_aln['snp_count'].sum()) + total_indel_bp = int(df_aln['indel_count_bp'].sum()) + total_indel_events = int(df_aln['indel_count_events'].sum()) + total_ambig = int(df_aln['ambiguity_diffs'].sum()) + total_terminal_bp_old = int(df_aln['terminal_extensions_old'].sum()) + total_terminal_bp_new = int(df_aln['terminal_extensions_new'].sum()) + total_terminal_events_old = int(df_aln['terminal_extension_events_old'].sum()) + total_terminal_events_new = int(df_aln['terminal_extension_events_new'].sum()) + else: + with_snps = with_indels = with_ambig = with_terminal = 0 + total_snps = total_indel_bp = total_indel_events = total_ambig = 0 + total_terminal_bp_old = total_terminal_bp_new = 0 + total_terminal_events_old = total_terminal_events_new = 0 + + report_path = os.path.join(report_dir, f'report_{workspace_name}.md') + with open(report_path, 'w') as f: + f.write(f"# Regression Report: {workspace_name}\n\n") + f.write(f"## Summary\n\n") + f.write(f"| Metric | Value |\n") + f.write(f"|--------|-------|\n") + f.write(f"| Total samples compared | {total_samples} |\n") + f.write(f"| Samples with assemblies (old) | {samples_with_assemblies} |\n") + f.write(f"| Samples with matching assembly count | {samples_count_match} |\n") + f.write(f"| Samples with mismatched assembly count | {samples_count_mismatch} |\n") + f.write(f"| Total assembly comparisons | {total_assemblies} |\n") + f.write(f"| Assemblies aligned | {len(df_aln)} |\n\n") + + f.write(f"## Assembly Identity\n\n") + f.write(f"| Category | Count | % |\n") + f.write(f"|----------|-------|---|\n") + if len(df_aln) > 0: + f.write(f"| Identical (100%) | {identical} | {100*identical/len(df_aln):.1f}% |\n") + f.write(f"| Near-identical (99.9-100%) | {near_identical} | {100*near_identical/len(df_aln):.1f}% |\n") + f.write(f"| Minor differences (99-99.9%) | {minor_diff} | {100*minor_diff/len(df_aln):.1f}% |\n") + f.write(f"| Significant differences (<99%) | {significant_diff} | {100*significant_diff/len(df_aln):.1f}% |\n") + f.write(f"\n") + + f.write(f"## Variant Counts\n\n") + f.write(f"| Metric | Assemblies affected | Events | Bases |\n") + f.write(f"|--------|--------------------:|-------:|------:|\n") + f.write(f"| SNPs (A/C/G/T ↔ A/C/G/T) | {with_snps} | {total_snps} | {total_snps} |\n") + f.write(f"| Internal indels | {with_indels} | {total_indel_events} | {total_indel_bp} |\n") + f.write(f"| Ambiguity diffs (N ↔ A/C/G/T) | {with_ambig} | {total_ambig} | {total_ambig} |\n") + f.write(f"| Terminal extensions (old only) | {with_terminal} | {total_terminal_events_old} | {total_terminal_bp_old} |\n") + f.write(f"| Terminal extensions (new only) | {with_terminal} | {total_terminal_events_new} | {total_terminal_bp_new} |\n\n") + + # Metrics summary + if len(df_aln) > 0: + f.write(f"## Metrics Summary\n\n") + f.write(f"| Metric | Median delta | Mean delta | Min | Max |\n") + f.write(f"|--------|-------------|------------|-----|-----|\n") + for col, label in [ + ('delta_percent_reference_covered', '% Ref Covered'), + ('delta_mean_coverage', 'Mean Coverage'), + ('delta_assembly_length_unambiguous', 'Unambig Length'), + ]: + vals = df[col].dropna() + if len(vals) > 0: + f.write(f"| {label} | {vals.median():.4f} | {vals.mean():.4f} | {vals.min():.4f} | {vals.max():.4f} |\n") + f.write(f"\n") + + # Divergent assemblies + if len(df_aln) > 0 and 'alignment_identity' in df_aln.columns: + divergent = df_aln[df_aln['alignment_identity'] < 0.999].sort_values('alignment_identity') + else: + divergent = pd.DataFrame() + if len(divergent) > 0: + f.write(f"## Divergent Assemblies (identity < 99.9%)\n\n") + f.write(f"| Assembly ID | Tax Name | Identity | SNPs | Indel events (bp) | Ambig Diffs | Term Ext Old events (bp) | Term Ext New events (bp) |\n") + f.write(f"|-------------|----------|----------|------|-------------------|-------------|--------------------------|---------------------------|\n") + for _, row in divergent.iterrows(): + indel_ev = int(row.get('indel_count_events', 0)) + indel_bp = int(row.get('indel_count_bp', 0)) + term_old_ev = int(row.get('terminal_extension_events_old', 0)) + term_old_bp = int(row.get('terminal_extensions_old', 0)) + term_new_ev = int(row.get('terminal_extension_events_new', 0)) + term_new_bp = int(row.get('terminal_extensions_new', 0)) + f.write(f"| {row['assembly_id']} | {row['tax_name']} | " + f"{row['alignment_identity']*100:.3f}% | {row['snp_count']:.0f} | " + f"{indel_ev} ({indel_bp}) | {row['ambiguity_diffs']:.0f} | " + f"{term_old_ev} ({term_old_bp}) | " + f"{term_new_ev} ({term_new_bp}) |\n") + f.write(f"\n") + + # Assembly count mismatches + if 'assembly_count_match' not in sample_df.columns: + mismatches = pd.DataFrame() + else: + mismatches = sample_df[~sample_df['assembly_count_match']] + if len(mismatches) > 0: + f.write(f"## Assembly Count Mismatches\n\n") + f.write(f"| Sample | Old Count | New Count | Only Old | Only New |\n") + f.write(f"|--------|-----------|-----------|----------|----------|\n") + for _, row in mismatches.iterrows(): + f.write(f"| {row['sample_id']} | {row['old_assembly_count']} | " + f"{row['new_assembly_count']} | {row['assemblies_only_in_old']} | " + f"{row['assemblies_only_in_new']} |\n") + f.write(f"\n") + + # Plot references + f.write(f"## Plots\n\n") + plot_files = sorted(os.listdir(plot_dir)) if os.path.isdir(plot_dir) else [] + for pf in plot_files: + if pf.endswith('.png'): + rel_plot_dir = os.path.basename(plot_dir) + f.write(f"![{pf}]({rel_plot_dir}/{pf})\n\n") + + log.info(f"Report written to {report_path}") + return report_path + + +def main(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument('--results-dir', required=True, help='Directory with per-sample JSON results') + parser.add_argument('--report-dir', required=True, help='Output directory for report') + parser.add_argument('--workspace-name', required=True, help='Workspace name for report title') + args = parser.parse_args() + + pd, plt = get_deps() + + results = load_results(args.results_dir) + log.info(f"Loaded {len(results)} sample results") + + # Build comparison table + comp_rows = build_comparison_table(results, args.workspace_name) + df = pd.DataFrame(comp_rows) + log.info(f"Total assembly comparisons: {len(df)}") + + # Build sample summary + sample_rows = build_sample_summary(results) + sample_df = pd.DataFrame(sample_rows) + + # Save summary TSV + os.makedirs(args.report_dir, exist_ok=True) + tsv_path = os.path.join(args.report_dir, f'summary_{args.workspace_name}.tsv') + if len(df) > 0: + df.to_csv(tsv_path, sep='\t', index=False) + log.info(f"Summary TSV written to {tsv_path}") + + # Generate plots + plot_dir = os.path.join(args.report_dir, f'plots_{args.workspace_name}') + if len(df) > 0: + generate_plots(df, plot_dir) + + # Generate markdown report + generate_markdown_report(df, sample_df, args.workspace_name, args.report_dir, plot_dir) + + +if __name__ == '__main__': + main() diff --git a/.agents/skills/regression-testing/run_vadr.sh b/.agents/skills/regression-testing/run_vadr.sh new file mode 100644 index 000000000..abb24ef88 --- /dev/null +++ b/.agents/skills/regression-testing/run_vadr.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# Run VADR on a single FASTA file. +# Inputs (env vars set by dsub): +# FASTA - input FASTA file (localized by dsub) +# VADR_OPTS - vadr options string +# MIN_LEN - minimum sequence length (optional) +# MAX_LEN - maximum sequence length (optional) +# MODEL_URL - URL to vadr model tarball +# MODEL_SUB - subdirectory within model tarball (optional) +# Outputs (env vars set by dsub): +# NUM_ALERTS - file to write num_alerts integer +# ALERTS_TSV - file to write alerts TSV +# VADR_TGZ - file to write full vadr output tarball + +set -euo pipefail + +# Default optional env vars to empty +MODEL_URL="${MODEL_URL:-}" +MODEL_SUB="${MODEL_SUB:-}" +MIN_LEN="${MIN_LEN:-}" +MAX_LEN="${MAX_LEN:-}" + +BASENAME=$(basename "${FASTA}" .fasta) + +# Download and unpack VADR models +if [ -n "${MODEL_URL}" ]; then + mkdir -p vadr-untar + curl -fsSL "${MODEL_URL}" | tar -C vadr-untar -xzf - + MODEL_DIR=$(find vadr-untar -mindepth 1 -maxdepth 1 -type d | head -1) + ln -s "${MODEL_DIR}" vadr-models +else + ln -s /opt/vadr/vadr-models vadr-models +fi + +if [ -n "${MODEL_SUB}" ]; then + VADR_MODEL_DIR="vadr-models/${MODEL_SUB}" +else + VADR_MODEL_DIR="vadr-models" +fi + +# Build trim args +TRIM_ARGS="" +if [ -n "${MIN_LEN}" ]; then + TRIM_ARGS="${TRIM_ARGS} --minlen ${MIN_LEN}" +fi +if [ -n "${MAX_LEN}" ]; then + TRIM_ARGS="${TRIM_ARGS} --maxlen ${MAX_LEN}" +fi + +# Remove terminal ambiguous nucleotides +/opt/vadr/vadr/miniscripts/fasta-trim-terminal-ambigs.pl \ + "${FASTA}" ${TRIM_ARGS} > "${BASENAME}.trimmed.fasta" + +# Run VADR +v-annotate.pl \ + ${VADR_OPTS} \ + --split --cpu $(nproc) \ + --mdir "${VADR_MODEL_DIR}" \ + "${BASENAME}.trimmed.fasta" \ + "${BASENAME}" + +# Package outputs +tar -C "${BASENAME}" -czf "${VADR_TGZ}" . + +# Extract alerts +cut -f 5 "${BASENAME}/${BASENAME}.vadr.alt.list" | tail -n +2 > alerts.tsv +cp alerts.tsv "${ALERTS_TSV}" +wc -l < alerts.tsv > "${NUM_ALERTS}" + +echo "VADR complete. Alerts: $(cat "${NUM_ALERTS}")" diff --git a/.claude/rules/container-vulns.md b/.claude/rules/container-vulns.md new file mode 100644 index 000000000..aa7af5192 --- /dev/null +++ b/.claude/rules/container-vulns.md @@ -0,0 +1,12 @@ +--- +paths: + - "docker/**" + - ".trivyignore" + - ".trivy-ignore-policy.rego" + - "vulnerability-mitigation-status.md" + - ".github/workflows/container-scan.yml" + - ".github/workflows/docker.yml" +--- + +For container vulnerability management guidance, see +.agents/skills/container-vulns/SKILL.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index d899a92d5..85a464535 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -29,3 +29,4 @@ docker run --rm \ | [docker/](../docker/) | Dockerfiles and requirements | | [src/viral_ngs/](../src/viral_ngs/) | Source code | | [tests/](../tests/) | Test files | +| [.agents/skills/](../.agents/skills/) | Reusable agent playbooks and scripts | diff --git a/AGENTS.md b/AGENTS.md index 88da6ab20..9d8dae00c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -504,72 +504,23 @@ micromamba search -c bioconda --subdir linux-aarch64 --- -## Container Vulnerability Management - -### Scanning - -Container images are scanned for vulnerabilities using [Trivy](https://aquasecurity.github.io/trivy/): - -- **On every PR/push**: `docker.yml` scans each image flavor after build (SARIF → GitHub Security tab, JSON → artifact) -- **Weekly schedule**: `container-scan.yml` scans the latest published images -- Scans filter to **CRITICAL/HIGH** severity, **ignore-unfixed**, and apply a Rego policy (`.trivy-ignore-policy.rego`) -- Per-CVE exceptions go in `.trivyignore` with mandatory justification comments - -### Rego Policy (`.trivy-ignore-policy.rego`) - -The Rego policy filters CVEs that are architecturally inapplicable to ephemeral batch containers: - -- **AV:P** (Physical access required) — containers are cloud-hosted -- **AV:A** (Adjacent network required) — no attacker on same network segment -- **AV:L + UI:R** (Local + user interaction) — no interactive sessions -- **AV:L + PR:H** (Local + high privileges) — containers run non-root -- **AV:L + S:U** (Local + scope unchanged) — attacker already has code execution and impact stays within the ephemeral container - -Changes to this policy should be reviewed carefully. The comments in the file explain the rationale and risk for each rule. - -### Common Vulnerability Sources - -**Python transitive deps**: Pin minimum versions in `docker/requirements/*.txt`. Prefer conda packages over pip. Check conda-forge availability before assuming a version exists — conda-forge often lags PyPI by days/weeks. +## Reusable Agent Skills -**Java fat JARs** (picard, gatk, snpeff, fgbio): Bioinformatics Java tools are distributed as uber JARs with all dependencies bundled inside. Trivy detects vulnerable libraries (log4j, commons-compress, etc.) baked into these JARs. Version bumps can cause ARM64 conda solver conflicts because Java tools pull in openjdk → harfbuzz → icu version chains that clash with other packages (r-base, boost-cpp, pyicu). Always check: -1. Whether the tool is actually flagged by Trivy (don't bump versions unnecessarily) -2. Whether the CVE applies (e.g., log4j 1.x is NOT vulnerable to Log4Shell) -3. Whether the desired version resolves on ARM64 before pushing +Established workflows and playbooks live in `.agents/skills/`. Each skill +directory contains a `SKILL.md` with the playbook and companion scripts. +Check there before building analysis pipelines from scratch. -**Go binaries**: Some conda packages bundle compiled Go binaries (e.g., mafft's `dash_client`, google-cloud-sdk's `gcloud-crc32c`). If the binary is unused, delete it in the Dockerfile. Delete from **both** the installed location and `/opt/conda/pkgs/*/` (conda package cache) — Trivy scans the full filesystem. +Available skills: +- **regression-testing** -- End-to-end assembly regression testing against Terra submissions +- **dsub-batch-jobs** -- Running one-off compute jobs on GCP via dsub +- **container-vulns** -- Container vulnerability scanning, triage, and mitigation -**Vendored copies**: Packages like google-cloud-sdk and setuptools bundle their own copies of Python libraries that may be older than what's in the conda environment. Trivy flags these vendored copies separately. Options: delete the vendored directory (if not needed at runtime), or accept the risk in `.trivyignore` with justification. - -### ARM64 Solver Conflicts - -The conda solver on ARM64 (linux-aarch64) is more constrained than amd64 because fewer package builds exist. Common conflict patterns: - -- **icu version conflicts**: Many packages (openjdk, r-base, boost-cpp, pyicu) pin specific icu version ranges. Bumping one package can make the entire environment unsolvable. -- **libdeflate/htslib conflicts**: lofreq 2.1.5 pins old htslib/libdeflate versions that conflict with newer pillow/libtiff. -- **openjdk version escalation**: snpeff 5.2+ requires openjdk>=11, 5.3+ requires openjdk>=21. Higher openjdk versions pull in harfbuzz→icu chains that conflict with everything. - -When a solver conflict occurs: revert the change, check what version the solver was picking before, and pin to that exact version if it already addresses the CVE. - -### Mitigation Decision Process - -When triaging a CVE: - -1. **Check the CVSS vector** — does the Rego policy already filter it? -2. **Identify the source package** — use Trivy JSON output (`PkgName`, `PkgPath`, `InstalledVersion`) -3. **Check if a fix version exists on conda-forge/bioconda** — not just on PyPI -4. **Test on ARM64** — solver conflicts are the most common failure mode -5. **If the fix version conflicts**: consider whether the CVE is exploitable in your deployment model. Document the risk assessment in `.trivyignore` or `vulnerability-mitigation-status.md`. -6. **If the vulnerable code is unused**: delete the binary/file inline in the Dockerfile (same RUN layer as install to avoid bloating images) +--- -### Key Files +## Container Vulnerability Management -| File | Purpose | -|------|---------| -| `.trivy-ignore-policy.rego` | Rego policy for class-level CVE filtering | -| `.trivyignore` | Per-CVE exceptions with justifications | -| `.github/workflows/docker.yml` | Build-time scanning (SARIF + JSON) | -| `.github/workflows/container-scan.yml` | Weekly scheduled scanning | -| `vulnerability-mitigation-status.md` | Local-only tracking doc (not committed) | +See `.agents/skills/container-vulns/SKILL.md` for detailed guidance on vulnerability +scanning, triaging CVEs, Rego policy, and ARM64 solver conflicts. --- diff --git a/CLAUDE.md b/CLAUDE.md index 5e0d072cf..71dc1d2e7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -29,3 +29,4 @@ docker run --rm \ | [docker/](docker/) | Dockerfiles and requirements | | [src/viral_ngs/](src/viral_ngs/) | Source code | | [tests/](tests/) | Test files | +| [.agents/skills/](.agents/skills/) | Reusable agent playbooks and scripts |