Skip to content

yttria-aniseia/HiFi-human-WGS-editing-QC-WDL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

784 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PacBio WGS Variant Pipeline

HiFi Human WGS CRISPR Edit QC Pipeline

Workflow for analyzing human PacBio whole genome sequencing (WGS) data with CRISPR editing quality control using Workflow Description Language (WDL).

This is a fork of the PacBio HiFi-human-WGS-WDL pipeline customized for CRISPR editing experiments. This fork adds edit-specific QC analysis while maintaining the core variant calling and analysis features.

  • Docker images used by this workflow are defined in the wdl-dockerfiles repo. Images are hosted in PacBio's quay.io repo.
  • Common tasks that may be reused within or between workflows are defined in the wdl-common repo. Note: In this fork, wdl-common is not actively maintained.

Workflow

Only the family workflow is supported in this fork. The family workflow is designed to analyze cohorts of related samples, which is ideal for CRISPR editing experiments where you typically have:

  • Parental (wild-type) samples
  • Edited clones derived from the parent
  • Optional additional family relationships

The workflow analyzes human PacBio HiFi whole genome sequencing data and includes specialized analysis for CRISPR edits when expected_edits are provided in the input configuration.

Workflow entrypoint:

CRISPR Edit QC Features:

  • Detection and validation of expected edits (insertions, deletions, substitutions)
  • Comparison of edited samples against parental baseline
  • Integration with standard germline and somatic variant calling
  • Support for complex edits including knock-ins with payload sequences

Setup

This fork uses git submodules for common tasks. To clone the repository with submodules:

git clone --recurse-submodules --depth=1 \
  https://github.com/yttria-aniseia/HiFi-human-WGS-editing-QC-WDL.git

For biohub-specific setup instructions including conda environment setup, reference data download, and running the pipeline, see docs/biohub-setup.md.

Resource requirements

The most resource-heavy step in the workflow requires 64 cpu cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

On some backends, you may be able to make use of a GPU to accelerate the DeepVariant step. The GPU is not required, but it can significantly speed up the workflow. If you have access to a GPU, you can set the gpu parameter to true in the inputs JSON file.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Setting up and executing the workflow

  1. Select a backend environment
  2. Configure a workflow execution engine in the chosen environment
  3. Fill out the inputs JSON file for your cohort
  4. Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine and container runtime

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.

See the backend-specific documentation for details on setting up an engine.

Engine Azure AWS GCP HPC
miniwdl Unsupported Supported via AWS HealthOmics Unsupported (SLURM only) Supported via the miniwdl-slurm plugin
Cromwell Supported via Cromwell on Azure Unsupported Supported via Google's Pipelines API Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Use example_input_config.json as a template.

Key steps:

  1. Define your samples with their HiFi read BAM files
  2. Specify family relationships (parent/child)
  3. For CRISPR-edited samples, provide an expected_edits JSON file path
  4. Point to your local reference map files (after running ./scripts/setup.sh)

See docs/biohub-setup.md for detailed instructions.

Automated workflow setup: This fork includes a launch.sh script that automates file staging and workflow setup. See scripts/README.md for details.

Running the workflow

Recommended approach using the automated launcher script:

# 1. Setup and stage inputs (with automatic workflow execution)
./scripts/launch.sh my_input_config.json --work-dir my_analysis_name --run

# Or setup only, run workflow manually later
./scripts/launch.sh my_input_config.json --work-dir my_analysis_name
conda activate hifi-wgs
bash my_analysis_name/run_workflow.sh

See scripts/README.md for detailed launcher script documentation.

Run directly using miniwdl (HPC with SLURM)

If not using the launcher script, you can run miniwdl directly:

miniwdl run --verbose \
  --cfg miniwdl.cfg \
  --dir output_directory \
  workflows/family.wdl \
  -i input_config.json

Note: This fork is primarily tested with miniwdl on HPC/SLURM environments. Support for other backends (Azure, GCP, AWS) may be limited.

Workflow inputs

Workflow inputs for the family entrypoint are described in family documentation.

At a high level, we have two types of input files:

  • Map files (TSV format) describe reference data and resources used for every workflow execution:

    • ref_map_file: Reference genome FASTA, indices, and core annotation files
    • tertiary_map_file: Population VCFs, SV databases, and tertiary analysis resources
    • somatic_map_file: Somatic variant calling resources
  • Input configuration JSON describes the samples to analyze and their relationships:

    • Sample metadata (ID, sex, affected status, family relationships)
    • Paths to HiFi read BAM files
    • For CRISPR editing experiments: expected_edits field defining anticipated genomic changes

Example input configuration:

{
  "humanwgs_family.family": {
    "family_id": "EXAMPLE_FAM",
    "samples": [
      {
        "sample_id": "parent",
        "hifi_reads": ["/path/to/parent.bam"],
        "sex": "FEMALE",
        "affected": false
      },
      {
        "sample_id": "edited_clone1",
        "hifi_reads": ["/path/to/clone1.bam"],
        "sex": "FEMALE",
        "affected": true,
        "mother_id": "parent",
        "expected_edits": "/path/to/edits.json"
      }
    ]
  },
  "humanwgs_family.ref_map_file": "/path/to/GRCh38.ref_map.tsv",
  "humanwgs_family.tertiary_map_file": "/path/to/GRCh38.tertiary_map.tsv",
  "humanwgs_family.somatic_map_file": "/path/to/GRCh38.somatic_map.tsv"
}

See example_expected_edit.json for the expected edit file format, or use the provided genbank_to_crispr_json.py helper script to generate edit descriptions from GenBank files.

The resource bundle containing the GRCh38 reference and other files used in this workflow can be downloaded from Zenodo:

10.5281/zenodo.17086906

Template map files are provided at the repository root: GRCh38.ref_map.v3p1p0.template.tsv, GRCh38.tertiary_map.v3p1p0.template.tsv, and GRCh38.somatic_map.v3p1p0.template.tsv. After downloading the reference bundle, update the paths in these templates to point to your local copies.

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io repo. Docker images used in the workflow are pinned to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87

Tool versions and Docker images used in these workflows can be found in the tools and containers documentation.


DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.

About

PacBio HiFi-human-WGS-WDL pipeline modified to facilitate CRISPR-edited cell line QC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • WDL 83.5%
  • Shell 11.0%
  • Python 5.5%