GitHub - veg/CAPHEINE: Automated evolutionary annotation of existing datasets using Nextflow

Introduction

CAPHEINE is a bioinformatics pipeline designed for comparative analysis of protein-coding genes using the HyPhy software suite. The pipeline ingests FASTA files containing raw DNA sequences along with FASTA files containing reference gene sequences, and performs multiple sequence alignment, phylogenetic tree construction, and various selection analyses. Key outputs include statistical tests for positive selection (BUSTED, FEL, MEME), branch-site models, and comprehensive quality control reports, all presented in an easy-to-interpret MultiQC report.

Ambiguous sequence removal
Multiple sequence alignment (cawlign)
Sequence deduplication and cleaning (HyPhy CLN)
Phylogenetic tree construction (IQ-TREE)
Selection analyses using HyPhy:
- FEL (Fixed Effects Likelihood)
- MEME (Mixed Effects Model of Evolution)
- PRIME (Probabilistic Inference of Molecular Evolution)
- BUSTED (Branch-Site Unrestricted Statistical Test for Episodic Diversification)
Optional branch-specific analyses when foreground branches are specified:
- Contrast-FEL
- RELAX
Collate and summarize results for all genes and analyses (DRHIP)

While CAPHEINE is not an official nf-core pipeline, it benefits from the nf-core ecosystem in several ways:

Development Standards: Built using the nf-core template and follows nf-core best practices
Modular Design: Uses the nf-core module system for maintainability
Containerization: Leverages BioContainers for reproducible analyses
Testing Framework: Utilizes nf-test for pipeline validation

This relationship allows CAPHEINE to maintain high standards of code quality and interoperability while allowing flexibility for the development roadmap and scope of the pipeline. Current scope is focused on viral non-recombinant data, but the pipeline is designed to be flexible and can be extended to other types of data.

Usage

First, ensure that you have Nextflow (version 25.10.0 or later) installed on your system. You can follow the Nextflow installation guide to get started.

You will also need to set up one of the following container environments:

Once Nextflow and your chosen container environment are installed, you can run CAPHEINE directly via Nextflow. The pipeline will be automatically downloaded at runtime using the following command:

nextflow run veg/CAPHEINE \
    <args>

Where <args> are the arguments you want to pass to the pipeline. For example, to run the pipeline with the default parameters, you can use:

nextflow run veg/CAPHEINE \
    --reference_genes <reference_genes.fasta> \
    --unaligned_seqs <unaligned_seqs.fasta> \
    --outdir <OUTDIR>

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Input Parameters

The main input parameters for the CAPHEINE pipeline are:

Parameter	Description	Required
`--reference_genes`	Path to FASTA file of gene reference sequences	Yes
`--unaligned_seqs`	Path to FASTA file of unaligned DNA sequences	Yes
`--outdir`	Output directory for results	Yes
`--test_branches`	Branches to test for HyPhy analyses, either 'internal' or 'all'. Usually set to 'internal' for viral non-recombinant data, to avoid testing non-fixed substitutions in leaf nodes. If used with `--foreground_list` or `--foreground_regexp` HyPhy will test foreground and background internal branches. If unset, HyPhy defaults to all branches for all analyses.	No
`--use_mpi`	Boolean. Run MPI-enabled HyPhy analyses (FEL, MEME, PRIME, and Contrast-FEL when applicable). BUSTED and RELAX run without MPI. Default: false.	No
`--foreground_list`	Path to a text file with a newline-separated list of foreground taxa. Only one of `foreground_list` or `foreground_regexp` should be provided per row.	No
`--foreground_regexp`	Regular expression to match foreground taxa. Only one of `foreground_list` or `foreground_regexp` should be provided per row.	No
`--email`	Email address for completion summary	No
`--multiqc_title`	Title for the MultiQC report	No
`--validate_params`	Boolean, validate parameters against the schema at runtime (default: true)	No
`--monochrome_logs`	Boolean, do not use colored log outputs	No
`--hook_url`	URL for notification hooks (if used)	No
`-params-file`	YAML/JSON file specifying parameters (recommended for reproducibility)	No

Additional advanced and institutional config parameters are available; see the documentation for details.

In general, you can run the pipeline with:

nextflow run veg/CAPHEINE \
   -profile <docker/singularity/.../institute> \
   --reference_genes <reference_genes.fasta> \
   --unaligned_seqs <unaligned_seqs.fasta> \
   --outdir <OUTDIR>

Where:

reference_genes: Path to FASTA file of gene reference sequences.
unaligned_seqs: Path to FASTA file of unaligned DNA sequences.
outdir: Output directory for results.

You can also provide additional parameters:

test_branches: (Optional) Branch selection for HyPhy analyses. Use internal to test only internal branches, or all to test all branches. We suggest setting this to internal for viral non-recombinant data, to avoid testing non-fixed substitutions in leaf nodes. If used with --foreground_list or --foreground_regexp HyPhy will test foreground and background internal branches. If unset, no flag is passed and HyPhy defaults to all branches.
use_mpi: (Optional) Enable MPI-enabled HyPhy analyses for faster runs on multi-core nodes. FEL, MEME, PRIME, and Contrast-FEL (when foreground branches are supplied) are embarassingly parallel and use MPI. BUSTED and RELAX are not easily parallelized and do not use MPI; when enabled, both analyses are run as usual and produce their normal output files. Requires a container runtime or environment with MPI. The HYPHYMPI binary is bundled with the docker containers and the conda packages, and should be available by default. Default: false.
foreground_list: (Optional) Path to a text file with a newline-separated list of foreground taxa.
foreground_regexp: (Optional) Regular expression to match foreground taxa.

Only one of foreground_list or foreground_regexp should be provided per row.

You can also run CAPHEINE using a parameter file (recommended for reproducibility):

nextflow run veg/CAPHEINE \
   -profile <docker/singularity/.../institute> \
   -params-file params.yaml

Where params.yaml might contain:

reference_genes: "./reference_genes.fasta"
unaligned_seqs: "./unaligned_seqs.fasta"
outdir: "./results/"
# Optional parameters
# test_branches: internal   # or 'all'; if unset, HyPhy runs on all branches by default
# foreground_list: "./foreground_taxa.txt"
# foreground_regexp: "^Homo.*"
# use_mpi: true              # enable MPI-enabled HyPhy (default: false)

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters.

For more details and further functionality, please refer to the CAPHEINE usage documentation and the nf-core custom configuration documentation.

Testing the pipeline

To test the pipeline, you can run it with the -profile test option. This will run the pipeline with a minimal test dataset to check that it completes without any syntax errors.

nextflow run veg/CAPHEINE \
-profile test,docker \
--outdir <OUTDIR>

Pipeline output

For more details about the output files and reports, please refer to the output documentation.

Specifying resources

HyPhy analyses are light on memory, but can be heavy on CPU usage for longer alignments. We have set sensible defaults for alignments with roughly 800 sites and 1500 branches in the hyphy profile. You may wish to customize that profile, or write your own cluster-specific configuration file, if your data or environment require different resources.

For non-HyPhy analysis modules, the pipeline uses nf-core standard process labels (e.g., process_single, process_low, process_medium, process_high) wherever possible to improve compatibility and interpretability with existing nf-core infrastructure. These labels define default CPU and memory allocations that can be easily customized in your configuration files. See the nf-core documentation on process labels for more information.

You can specify the resources to be used by the pipeline using the -profile option. For example, to run the pipeline with the default 16 CPUs and 6 GB of memory per HyPhy process, you can use:

nextflow run veg/CAPHEINE \
-profile hyphy,<docker/singularity/.../institute> \
--outdir <OUTDIR> \

When running with --use_mpi, each MPI job uses the number of CPUs configured for the process (mpirun -np $task.cpus).

See usage documentation for more information about running CAPHEINE on your own system and best practices for writing your own profiles.

Credits

CAPHEINE was originally written by Hannah Verdonk and Danielle Callan.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch by creating a github issue!

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Usage

Input Parameters

Testing the pipeline

Pipeline output

Specifying resources

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

veg/CAPHEINE

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

Input Parameters

Testing the pipeline

Pipeline output

Specifying resources

Credits

Contributions and Support

Citations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages