CPG-Flow - Long Read Sequencing Annotation for seqr

Version 0.2.0

A CPG workflow for creating annotated callsets from long read data, using the cpg-flow pipeline framework.

Purpose

This workflow is designed to process long-read sequencing data and create callsets compatible with seqr. It automates the steps required to query, reformat, annotate, and export VCF files derived from long-read sequencing. It also supports the conversion of BAM files to CRAMs.

It is intended to be used for SNPs/Indels VCFs and SVs VCFs from different sequencing technologies, such as PacBio, Oxford Nanopore, and others. The inputs can be configured by the user, allowing for flexibility in the types of long-read data queried and processed.

Workflow Overview

Annotation

Query Metamist for long-read sequencing groups and their VCF analyses based on the filters specified in the configuration 2. If the SGs from the input cohorts do not have VCFs matching the filter criteria, the workflow will fail.
Perform necessary reformatting, reheadering, and normalization of the VCFs
Merge the VCFs and annotate the merged VCF with VEP (for SNPs Indels) or STRVCTVRE (for SVs)
Write the annotated VCF to a Matrix table
Export the Matrix table to an elasticsearch index

Conversion

Query Metamist for long-read sequencing groups and their BAM assays
Convert BAM files to CRAM files using samtools

Directory Structure

src
├── lrs_annotation
│   ├── __init__.py
│   ├── run_workflow.py
│   ├── lrs_annotation.toml
│   ├── bam_to_cram_stages.py
│   ├── bam_to_cram_stages.toml
│   ├── snps_indels_annotation_stages.py
│   ├── snps_indels_annotation.toml
│   ├── svs_annotation_stages.py
│   ├── svs_annotation.toml
│   ├── inputs.py
│   ├── utils.py
│   ├── jobs
│   │   ├── snps_indels
│   │   │   ├── AnnotateCohortMatrixtable.py
│   │   │   └── AnnotateDatasetMatrixtable.py
│   │   │   └── ...
│   │   └── svs
│   │   │   ├── AnnotateCohortMatrixtable.py
│   │   │   └── AnnotateDatasetMatrixtable.py
│   │   │   └── ...
│   ├── scripts
│   │   ├── snps_indels
│   │   │   ├── __init__.py
│   │   │   ├── annotate_cohort_snps_indels.py
│   │   │   ├── ...
│   │   ├── svs
│   │   │   ├── __init__.py
│   │   │   ├── annotate_cohort_svs.py
│   │   │   ├── ...
│   ├── hail_scripts/computed_fields
│   │   ├── __init__.py
│   │   ├── variant_id.py
│   │   └── vep.py

Key Components

lrs_annotation.toml contains the main configuration for the workflow, including a number of mandatory options shared between the SNPs/Indels and SVs workflows.

snps_indels_annotation_stages.py and svs_annotation_stages.py contain Stages for the workflows, with the actual logic imported from files in jobs.

snps_indels_annotation.toml and svs_annotation.toml are config files to be submitted with the workflow via analysis-runner. They contain settings which are used to configure the workflow, such as filters for fetching input VCFs, resource allocation for specific jobs, and references.

scripts/ contains the scripts that are required for the workflows, often as Query on Batch jobs.

hail_scripts/computed_fields/ contains computed fields required for formatting the callsets for seqr, borrowed from the seqr-loading-pipelines

inputs.py contains functions to query Metamist for long-read sequencing groups and their VCF analyses, as well as functions to fetch the necessary VCFs based on the configuration.

utils.py contains utility functions used across the workflows, such as parsing command line arguments, submitting cromwell jobs, and reading the configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
src/lrs_annotation		src/lrs_annotation
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pull_request_template.md		pull_request_template.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPG-Flow - Long Read Sequencing Annotation for seqr

Purpose

Workflow Overview

Annotation

Conversion

Directory Structure

Key Components

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

populationgenomics/cpg-flow-lrs-annotation

Folders and files

Latest commit

History

Repository files navigation

CPG-Flow - Long Read Sequencing Annotation for seqr

Purpose

Workflow Overview

Annotation

Conversion

Directory Structure

Key Components

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages