CAPHEINE is a bioinformatics pipeline designed for comparative analysis of protein-coding genes using the HyPhy software suite. The pipeline ingests FASTA files containing raw DNA sequences along with FASTA files containing reference gene sequences, and performs multiple sequence alignment, phylogenetic tree construction, and various selection analyses. Key outputs include statistical tests for positive selection (BUSTED, FEL, MEME), branch-site models, and comprehensive quality control reports, all presented in an easy-to-interpret MultiQC report.
- Ambiguous sequence removal
- Multiple sequence alignment (
cawlign) - Sequence deduplication and cleaning (
HyPhy CLN) - Phylogenetic tree construction (
IQ-TREE) - Selection analyses using HyPhy:
- Optional branch-specific analyses when foreground branches are specified:
- Collate and summarize results for all genes and analyses (
DRHIP)
While CAPHEINE is not an official nf-core pipeline, it benefits from the nf-core ecosystem in several ways:
- Development Standards: Built using the nf-core template and follows nf-core best practices
- Modular Design: Uses the nf-core module system for maintainability
- Containerization: Leverages BioContainers for reproducible analyses
- Testing Framework: Utilizes nf-test for pipeline validation
This relationship allows CAPHEINE to maintain high standards of code quality and interoperability while allowing flexibility for the development roadmap and scope of the pipeline. Current scope is focused on viral non-recombinant data, but the pipeline is designed to be flexible and can be extended to other types of data.
First, ensure that you have Nextflow (version 25.10.0 or later) installed on your system. You can follow the Nextflow installation guide to get started.
You will also need to set up one of the following container environments:
Once Nextflow and your chosen container environment are installed, you can run CAPHEINE directly via Nextflow. The pipeline will be automatically downloaded at runtime using the following command:
nextflow run veg/CAPHEINE \
<args>Where <args> are the arguments you want to pass to the pipeline. For example, to run the pipeline with the default parameters, you can use:
nextflow run veg/CAPHEINE \
--reference_genes <reference_genes.fasta> \
--unaligned_seqs <unaligned_seqs.fasta> \
--outdir <OUTDIR>Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
The main input parameters for the CAPHEINE pipeline are:
| Parameter | Description | Required |
|---|---|---|
--reference_genes |
Path to FASTA file of gene reference sequences | Yes |
--unaligned_seqs |
Path to FASTA file of unaligned DNA sequences | Yes |
--outdir |
Output directory for results | Yes |
--test_branches |
Branches to test for HyPhy analyses, either 'internal' or 'all'. Usually set to 'internal' for viral non-recombinant data, to avoid testing non-fixed substitutions in leaf nodes. If used with --foreground_list or --foreground_regexp HyPhy will test foreground and background internal branches. If unset, HyPhy defaults to all branches for all analyses. |
No |
--use_mpi |
Boolean. Run MPI-enabled HyPhy analyses (FEL, MEME, PRIME, and Contrast-FEL when applicable). BUSTED and RELAX run without MPI. Default: false. | No |
--foreground_list |
Path to a text file with a newline-separated list of foreground taxa. Only one of foreground_list or foreground_regexp should be provided per row. |
No |
--foreground_regexp |
Regular expression to match foreground taxa. Only one of foreground_list or foreground_regexp should be provided per row. |
No |
--email |
Email address for completion summary | No |
--multiqc_title |
Title for the MultiQC report | No |
--validate_params |
Boolean, validate parameters against the schema at runtime (default: true) | No |
--monochrome_logs |
Boolean, do not use colored log outputs | No |
--hook_url |
URL for notification hooks (if used) | No |
-params-file |
YAML/JSON file specifying parameters (recommended for reproducibility) | No |
Additional advanced and institutional config parameters are available; see the documentation for details.
In general, you can run the pipeline with:
nextflow run veg/CAPHEINE \
-profile <docker/singularity/.../institute> \
--reference_genes <reference_genes.fasta> \
--unaligned_seqs <unaligned_seqs.fasta> \
--outdir <OUTDIR>Where:
reference_genes: Path to FASTA file of gene reference sequences.unaligned_seqs: Path to FASTA file of unaligned DNA sequences.outdir: Output directory for results.
You can also provide additional parameters:
test_branches: (Optional) Branch selection for HyPhy analyses. Useinternalto test only internal branches, orallto test all branches. We suggest setting this tointernalfor viral non-recombinant data, to avoid testing non-fixed substitutions in leaf nodes. If used with--foreground_listor--foreground_regexpHyPhy will test foreground and background internal branches. If unset, no flag is passed and HyPhy defaults to all branches.use_mpi: (Optional) Enable MPI-enabled HyPhy analyses for faster runs on multi-core nodes. FEL, MEME, PRIME, and Contrast-FEL (when foreground branches are supplied) are embarassingly parallel and use MPI. BUSTED and RELAX are not easily parallelized and do not use MPI; when enabled, both analyses are run as usual and produce their normal output files. Requires a container runtime or environment with MPI. TheHYPHYMPIbinary is bundled with the docker containers and the conda packages, and should be available by default. Default: false.foreground_list: (Optional) Path to a text file with a newline-separated list of foreground taxa.foreground_regexp: (Optional) Regular expression to match foreground taxa.
Only one of foreground_list or foreground_regexp should be provided per row.
You can also run CAPHEINE using a parameter file (recommended for reproducibility):
nextflow run veg/CAPHEINE \
-profile <docker/singularity/.../institute> \
-params-file params.yamlWhere params.yaml might contain:
reference_genes: "./reference_genes.fasta"
unaligned_seqs: "./unaligned_seqs.fasta"
outdir: "./results/"
# Optional parameters
# test_branches: internal # or 'all'; if unset, HyPhy runs on all branches by default
# foreground_list: "./foreground_taxa.txt"
# foreground_regexp: "^Homo.*"
# use_mpi: true # enable MPI-enabled HyPhy (default: false)Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters.
For more details and further functionality, please refer to the CAPHEINE usage documentation and the nf-core custom configuration documentation.
To test the pipeline, you can run it with the -profile test option. This will run the pipeline with a minimal test dataset to check that it completes without any syntax errors.
nextflow run veg/CAPHEINE \
-profile test,docker \
--outdir <OUTDIR>For more details about the output files and reports, please refer to the output documentation.
HyPhy analyses are light on memory, but can be heavy on CPU usage for longer alignments. We have set sensible defaults for alignments with roughly 800 sites and 1500 branches in the hyphy profile. You may wish to customize that profile, or write your own cluster-specific configuration file, if your data or environment require different resources.
For non-HyPhy analysis modules, the pipeline uses nf-core standard process labels (e.g., process_single, process_low, process_medium, process_high) wherever possible to improve compatibility and interpretability with existing nf-core infrastructure. These labels define default CPU and memory allocations that can be easily customized in your configuration files. See the nf-core documentation on process labels for more information.
You can specify the resources to be used by the pipeline using the -profile option. For example, to run the pipeline with the default 16 CPUs and 6 GB of memory per HyPhy process, you can use:
nextflow run veg/CAPHEINE \
-profile hyphy,<docker/singularity/.../institute> \
--outdir <OUTDIR> \When running with --use_mpi, each MPI job uses the number of CPUs configured for the process (mpirun -np $task.cpus).
See usage documentation for more information about running CAPHEINE on your own system and best practices for writing your own profiles.
CAPHEINE was originally written by Hannah Verdonk and Danielle Callan.
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch by creating a github issue!
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.