Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions .claude/skills/prepare-edit-inputs/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@ Goal: produce a valid `input_config.json` (+ per-sample expected-edit JSONs) tha

## 0. Preflight: is the reference data present?

Before anything else, confirm `scripts/setup.sh` has been run (see AGENTS.md guardrail 1):
- Check `hifi-wdl-resources-v3.1.0/` exists and the paths inside
Before anything else, confirm the resource bundle has been fetched (see AGENTS.md guardrail 1).
`scripts/setup.sh` (or `scripts/fetch_resources.sh` directly) pulls the frozen Zenodo bundle
into `editing-qc-resources-v3.1.0/` and writes populated map files
`GRCh38.{ref,tertiary,somatic}_map.v3p1p0.template.tsv` at the repo root:
- Check `editing-qc-resources-v3.1.0/` exists and the absolute paths inside
`GRCh38.ref_map.v3p1p0.template.tsv` resolve to real files.
- **`<prefix>` placeholders:** unmodified templates contain `<prefix>/...` paths.
`setup.sh` strips these (`sed "s/<prefix>\///g"`). If you still see `<prefix>` in a map
file, the template was never populated — tell the user to run `scripts/setup.sh`. Do not
hand-edit placeholders into guessed absolute paths.
- **`<prefix>` placeholders:** the in-bundle `*.template.tsv` files contain `<prefix>/...`
paths; `fetch_resources.sh` substitutes the absolute bundle path when writing the populated
`*.template.tsv`. If you still see `<prefix>` in a populated map file, the fetch never ran — tell the
user to run `scripts/setup.sh`. Do not hand-edit placeholders into guessed absolute paths.

## Interpreting a sample request into samples

Expand Down Expand Up @@ -73,15 +76,17 @@ Resolution rules that recur regardless of format:

## 1. dbNSFP licensing (do this once, flag every time)

`setup.sh` auto-downloads **dbNSFP v4.9a** as a *fallback only*. v4.x omits columns that are
license-gated for commercial use. **Academic users should register at
https://www.dbnsfp.org/download and obtain dbNSFP v5.3+**, then:
- place the indexed file alongside the other references,
- update `DBNSFP_VERSION` in `setup.sh` (or the download step) and
- update the `dbnsfp` entry in `GRCh38.somatic_map.v3p1p0.template.tsv` to point at it.
dbNSFP is **license-gated and deliberately NOT in the frozen bundle** (academic-free /
commercial-licensed). The pinned version lives in `resources/manifest.tsv` (`dbnsfp_ver`).
**Academic users register at https://www.dbnsfp.org/download, obtain the pinned version**,
build the bgzip+tabix `_grch38.gz` file, then re-run the fetch step pointing at it:
- `./scripts/fetch_resources.sh --dbnsfp /path/to/dbNSFP_grch38.gz` (`.tbi` alongside) —
this patches the `dbnsfp_file`/`dbnsfp_file_index` entries in the populated
`GRCh38.somatic_map.v3p1p0.template.tsv`.
- To change the pinned version, edit `dbnsfp_ver` in `resources/manifest.tsv`.

The agent cannot download a registered file — surface this to the user and confirm which
dbNSFP version their somatic map points at.
dbNSFP version their somatic map points at. Until provided, those entries are placeholders.

## 2. Expected-edit JSON from a Benchling GenBank file

Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,5 @@ workflows/snakemake
.env
.venv
.env
editing-qc-resources-*/
GRCh38.*_map.*.template.tsv
20 changes: 15 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,17 @@ cluster**.
## Hard guardrails (must hold even when no skill is invoked)

1. **Verify prerequisite reference data before doing anything that runs the workflow.**
`scripts/setup.sh` downloads ~200 GB into `hifi-wdl-resources-v3.1.0/` and populates the
map templates (`GRCh38.{ref,tertiary,somatic}_map.v3p1p0.template.tsv`). If the bundle dir
or the paths referenced inside the map files are **missing, STOP and tell the user to run
`scripts/setup.sh`** — never fabricate or guess reference paths.
`scripts/setup.sh` (→ `scripts/fetch_resources.sh`) pulls a **single frozen Zenodo bundle**
(resumable, checksum-verified) into `editing-qc-resources-v3.1.0/` and writes populated map
files `GRCh38.{ref,tertiary,somatic}_map.v3p1p0.template.tsv` at the repo root with absolute
paths (same filename as before, now holding substituted absolute paths).
If the bundle dir or the paths referenced inside the map files are **missing, STOP and tell
the user to run `scripts/setup.sh`** — never fabricate or guess reference paths.
- Versions are pinned in `resources/manifest.tsv` (the single source of truth).
- **dbNSFP is license-gated and not in the bundle** — the user supplies it via
`fetch_resources.sh --dbnsfp`; until then its somatic-map entries are placeholders.
- Rebuilding the bundle for a new version is a **maintainer** task via
`scripts/build_resource_bundle.sh` (the slow upstream pull) — never run that during setup.

2. **Never read pipeline logs in full.** `workflow.log` and task stderr are enormous and
highly repetitive. Always `grep -E "ERROR|FAILED|failed"`, `tail`, or scope to one
Expand Down Expand Up @@ -89,7 +96,10 @@ cluster**.
later, which gives cleaner variant-filtering statistics for publication. It should have
no effect on the actual final reported variants.
- `genbank_to_crispr_json.py` — Benchling GenBank → expected-edit JSON converter.
- `scripts/setup.sh` — one-time reference + container download.
- `resources/manifest.tsv` — **single source of truth** for resource versions/URLs/checksums.
- `scripts/setup.sh` — one-time setup: fetch frozen bundle (delegates to fetch_resources.sh) + build knock-knock container.
- `scripts/fetch_resources.sh` — user-facing: resumable, checksum-verified pull of the frozen Zenodo bundle.
- `scripts/build_resource_bundle.sh` — **maintainer-only**: slow upstream pull → frozen bundle tarball for Zenodo.
- `scripts/launch.sh` — stages inputs and (optionally `--run`) launches the workflow.
- `scripts/process_input_config.py` — BAM merge/strip/stage, called by launch.sh.
- `scripts/create_image_manifest.sh` / `populate_miniwdl_singularity_cache.sh` — container prepull.
Expand Down
5 changes: 0 additions & 5 deletions GRCh38.somatic_map.v3p1p0.template.tsv
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
trf_bed hifisomatic_resources/human_GRCh38_no_alt_analysis_set.trf.bed
ref_bed hifisomatic_resources/chr.bed
ref_gff hifisomatic_resources/Homo_sapiens.GRCh38.112.chr.reformatted.gff3
control_vcf hifisomatic_resources/severus.jasmine.AN10.AC4.nosample.vcf.gz
control_vcf_index hifisomatic_resources/severus.jasmine.AN10.AC4.nosample.vcf.gz.tbi
severus_pon_tsv PoN_1000G_hg38_extended.tsv.gz
vep_cache homo_sapiens_refseq_vep_115_GRCh38.tar.gz
annotsv_cache annotsv_cache.tar.gz
dbnsfp_file dbNSFP5.3.1a_grch38.gz
Expand Down
40 changes: 36 additions & 4 deletions docs/backend-hpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Cromwell supports a number of different HPC backends; see [Cromwell's documentat

### Filling out workflow inputs

Create an input configuration JSON describing your samples and their relationships. See [example_input_config.json](../example_input_config.json) as a template. After downloading reference data with `./scripts/setup.sh`, the template map files at the repository root will be populated with local paths.
Create an input configuration JSON describing your samples and their relationships. See [example_input_config.json](../example_input_config.json) as a template. After downloading reference data with `./scripts/setup.sh`, the populated map files (`GRCh38.{ref,tertiary,somatic}_map.v3p1p0.template.tsv`) at the repository root will hold absolute local paths into the frozen bundle.

See [family.md](./family.md) for input structure details, or [biohub-setup.md](./biohub-setup.md) for biohub-specific instructions.

Expand Down Expand Up @@ -80,12 +80,44 @@ cromwell run workflows/family.wdl --input <inputs_json_file>

## Reference data bundle

[<img src="https://zenodo.org/badge/DOI/10.5281/zenodo.17086906.svg" alt="10.5281/zenodo.17086906">](https://zenodo.org/records/17086906)
[<img src="https://zenodo.org/badge/DOI/10.5281/zenodo.20856447.svg" alt="10.5281/zenodo.20856447">](https://zenodo.org/records/20856447)

Reference data is hosted on Zenodo. Use the provided setup script to download and configure:
Reference data (~46 GB) is hosted on Zenodo. The download is resumable and checksum-verified; it typically takes 15–30 minutes depending on network speed.

Before running setup, collect two prerequisites:

**1. Set `SINGULARITY_CACHEDIR`** to a location with sufficient space (the knock-knock container
is several GB). If unset, `setup.sh` skips the knock-knock build with a warning. A good default
is a directory under the repo root on scratch:

```bash
export SINGULARITY_CACHEDIR="$(pwd)/miniwdl_cache/singularity_cache"
mkdir -p "${SINGULARITY_CACHEDIR}"
```

**2. Obtain dbNSFP** (license-gated, not in bundle). dbNSFP is required for somatic variant
annotation. Request a licensed copy from [https://www.dbnsfp.org/download](https://www.dbnsfp.org/download).
You need the **GRCh38 BGZF files** listed under _"dbNSFP variants in BGZF format for VEP and
SnpEff annotation programs (sorted by GRCh38 and GRCh37 coordinates)"_:

- `dbNSFP5.3.1a_grch38.gz`
- `dbNSFP5.3.1a_grch38.gz.tbi`

Have the path ready before running setup — passing it via `--dbnsfp` avoids re-running the
full download a second time. If you cannot obtain dbNSFP yet, setup will complete with
placeholders in the somatic map.

Then download the bundle (~46 GB, resumable, typically 15–30 min) and build the knock-knock container:

```bash
# Recommended: pass dbNSFP in the same invocation
./scripts/setup.sh --dbnsfp /path/to/dbNSFP5.3.1a_grch38.gz

# Without dbNSFP (somatic map will have placeholders):
./scripts/setup.sh

# Resources only, no knock-knock build:
./scripts/fetch_resources.sh --dbnsfp /path/to/dbNSFP5.3.1a_grch38.gz
```

This downloads ~200GB of reference files and updates the template map files with local paths. The process takes several hours.
All commands extract the bundle under the repo root and write ready-to-use map files (`GRCh38.{ref,tertiary,somatic}_map.v3p1p0.template.tsv`) with absolute paths. If a download is interrupted, re-run — it resumes from where it stopped.
42 changes: 38 additions & 4 deletions docs/biohub-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,49 @@ conda activate hifi-wdl

Create or edit `~/.config/miniwdl.cfg` for your HPC environment. See [docs/backend-hpc.md](backend-hpc.md) for SLURM-specific configuration.

### 4. Download reference data
### 4. Download reference data and build containers

This downloads ~200GB of reference files and will take several hours:
Before running setup, collect two prerequisites:

#### 4a. Set container cache location

`setup.sh` builds the knock-knock Singularity image into `$SINGULARITY_CACHEDIR`. On Biohub
HPC the default home-directory cache is too small — point it at scratch storage first:

```bash
./scripts/setup.sh
export SINGULARITY_CACHEDIR="$(pwd)/miniwdl_cache/singularity_cache"
mkdir -p "${SINGULARITY_CACHEDIR}"
```

This populates the reference map template files at the repository root with local paths.
If `SINGULARITY_CACHEDIR` is unset, `setup.sh` will skip the knock-knock build with a warning
and you will need to re-run it later with the variable set.

#### 4b. Obtain dbNSFP (license-gated)

dbNSFP is required for somatic variant annotation but is not in the bundle due to licensing.
Request a licensed copy from [https://www.dbnsfp.org/download](https://www.dbnsfp.org/download).
You need the **GRCh38 BGZF files** listed under _"dbNSFP variants in BGZF format for VEP and
SnpEff annotation programs (sorted by GRCh38 and GRCh37 coordinates)"_:

- `dbNSFP5.3.1a_grch38.gz`
- `dbNSFP5.3.1a_grch38.gz.tbi`

Have the path to `dbNSFP5.3.1a_grch38.gz` ready before running setup — passing it via
`--dbnsfp` avoids having to re-run the full download a second time. If you cannot obtain
dbNSFP yet, setup will complete with placeholders in the somatic map.

#### 4c. Run setup

Downloads ~46 GB from Zenodo (resumable; typically 15–30 min), writes map files with
absolute paths, and builds the knock-knock container:

```bash
# Recommended: pass dbNSFP in the same invocation
./scripts/setup.sh --dbnsfp /path/to/dbNSFP5.3.1a_grch38.gz

# Without dbNSFP (somatic map will have placeholders — patch later with fetch_resources.sh --dbnsfp):
./scripts/setup.sh
```

## Preparing Input Files

Expand Down
1 change: 0 additions & 1 deletion docs/tools_containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,5 @@ This fork adds specialized tools for CRISPR editing validation and somatic varia
| --------: | ------------------- | :---: |
| ensembl-vep | <ul><li>VEP (variant annotation)</li></ul> | [ensembl-vep@sha256:e7612ab7c2923f2b9a78592b939e74874cd29f7494d70ee7135c8303841b03a8](https://hub.docker.com/r/ensemblorg/ensembl-vep) |
| annotsv | <ul><li>AnnotSV (SV annotation)</li></ul> | [annotsv@sha256:0c73fef5fa529b11e10bea0355480f01b56d0feb21af54cb9bbbd1f9f4c862a7](https://quay.io/repository/biocontainers/annotsv) |
| severus | <ul><li>Severus (phased SV calling)</li></ul> | [severus@sha256:fb4471e0504d564de78215ae15c081a1bb2022ad51e993eba92bc6fa5052a05d](https://quay.io/repository/biocontainers/severus) |
| somatic_general_tools | <ul><li>Somatic analysis utilities</li></ul> | [somatic_general_tools@sha256:a25a2e62b88c73fa3c18a0297654420a4675224eb0cf39fa4192f8a1e92b30d6](https://quay.io/repository/pacbio/somatic_general_tools) |
| chord | <ul><li>CHORD (HRD prediction)</li></ul> | [chord@sha256:9f6aa44ffefe3f736e66a0e2d7941d4f3e1cc6d848a9a11a17e85a6525e63a77](https://hub.docker.com/r/scwatts/chord) |
42 changes: 42 additions & 0 deletions resources/manifest.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# manifest.tsv — single source of truth for resource versions/provenance.
#
# This pins every upstream resource that build_resource_bundle.sh pulls and
# freezes into the bundle tarball. Bump a version HERE, re-run
# build_resource_bundle.sh, re-upload to Zenodo, update BUNDLE_* below. Users
# never read this file — they only run fetch_resources.sh against the frozen
# Zenodo record.
#
# Columns: key<TAB>value
#
# --- bundle identity (what fetch_resources.sh downloads) ----------------------
bundle_version v3.1.0
bundle_tar editing-qc-resources-v3.1.0.tar
# Zenodo record holding the frozen bundle.
bundle_zenodo_record 20856447
# sha256 of the bundle tar; written by build_resource_bundle.sh, checked by fetch.
bundle_sha256 2fb5f71f6af9c69cfdf34edc4e75a208fb6db8fdfe48a8bfffa504e938a77763
#
# --- redistributable upstreams (frozen INTO the bundle) -----------------------
# key value
hifi_wdl_resources_ver v3.1.0
hifi_wdl_resources_url https://zenodo.org/records/17086906/files/hifi-wdl-resources-v3.1.0.tar
hg002_fasta_url https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.1.fasta.gz
hg002_chain_url https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/changes/hg002v1.1_to_GRCh38.chain.gz
# hifisomatic suite — only the reformatted Ensembl GFF3 (ref_gff) is extracted into
# the bundle; the rest belonged to the removed Severus somatic-calling path.
hifisomatic_resources_url https://zenodo.org/record/14847828/files/hifisomatic_resources.tar.gz
# AnnotSV: install-script commit AND the (annotation_ver, exomiser_ver) args are
# pinned together so the cache always matches the AnnotSV container in the pipeline.
annotsv_install_commit b270de3f6db45e4c4ad6b32e7fc868f2369b62c3
annotsv_annotation_ver 3.5
annotsv_exomiser_ver 2406
vep_ver 115
vep_cache_url https://ftp.ensembl.org/pub/release-115/variation/indexed_vep_cache/homo_sapiens_refseq_vep_115_GRCh38.tar.gz
clinvar_vcf_url https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
clinvar_tbi_url https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
#
# --- license-gated, NOT in the bundle (user obtains separately) ---------------
# dbNSFP: academic-free / commercial-licensed. fetch_resources.sh prints
# instructions and patches the somatic map; the file is never re-hosted.
dbnsfp_ver 5.3.1a
dbnsfp_download_page https://www.dbnsfp.org/download
20 changes: 18 additions & 2 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,10 +186,26 @@ ls -la outputs/

4. **Input Files**: Ensure all file paths in your input configuration JSON are accessible

5. **Reference Data**: Download the reference bundle from Zenodo if using local paths
5. **Reference Data**: Download the frozen reference bundle (~46 GB) from Zenodo (resumable,
checksum-verified, typically 15–30 min) and build the knock-knock container.
`setup.sh` reads `$SINGULARITY_CACHEDIR` for the container build — set it to a path with
sufficient space before running (unset = knock-knock build is skipped with a warning):
```bash
./scripts/setup.sh
export SINGULARITY_CACHEDIR="$(pwd)/miniwdl_cache/singularity_cache"
mkdir -p "${SINGULARITY_CACHEDIR}"
./scripts/setup.sh # bundle + knock-knock image
# or just the resources: ./scripts/fetch_resources.sh [--dbnsfp FILE]
```
Versions are pinned in `resources/manifest.tsv`. **dbNSFP is license-gated and not in the
bundle.** Obtain `dbNSFP5.3.1a_grch38.gz` + `dbNSFP5.3.1a_grch38.gz.tbi` from
[https://www.dbnsfp.org/download](https://www.dbnsfp.org/download) (look for the GRCh38
BGZF files under _"dbNSFP variants in BGZF format for VEP and SnpEff"_), then pass
`--dbnsfp` to either `setup.sh` or `fetch_resources.sh`:
```bash
./scripts/setup.sh --dbnsfp /path/to/dbNSFP5.3.1a_grch38.gz
```
Until then, `dbnsfp_file` entries in the somatic map are left as placeholders. Maintainers
rebuild the bundle for a new version with `./scripts/build_resource_bundle.sh`.

### Advanced Usage Examples

Expand Down
Loading