Skip to content
This repository was archived by the owner on Nov 29, 2021. It is now read-only.

Configuration_files

tdayris-perso edited this page Mar 31, 2020 · 1 revision

Configuration

Two configuration files are required for this pipeline:

- `config.yaml` contains command line arguments, reference paths, and system options. This is a [yaml](https://en.wikipedia.org/wiki/YAML) file.
- `design.tsv` contains sample's identifiers and paths.

We suggest that you use provided script to build configuration files, and then modify them if needed. Most of the time, this scripts will be enough for you. Just look at:

- `prepare_pipeline.py`

However, if you want to, you can build them manually: every single part of these files are described below.

Automatic configuration building

prepare_pipeline.py

The script prepare_pipeline.py is your friend during the fastidious step of pipeline customization: it builds both config file and design file. By default, this script will not overwrite any existing files.

Your can test the prepare_pipeline.py by running make all-unit-test. See the section of this documentation that is related to "Testing" for more information.

You may have all possible arguments of the script prepare_pipeline.py with its argument --help:

# Activate conda environment
conda activate vcf-annotate-snpeff-snpsift

# Read help
python3.8 prepare_pipeline.py --help

Please, find below running examples:

# In case I want all default parameters, and my VCF files are in vcf_dir:
python3.8 vcf_dir path/GWASCat.tsv path/GeneSets.gmt path/dbNSFP.tsv

# Same case as above, but
# - I want snpeff not to run with pre-installed genomes
# - I wans to search recursively in vcf_dir for VCF files
python3.8 vcf_dir \
          path/GWASCat.tsv \
          path/GeneSets.gmt \
          path/dbNSFP.tsv \
          --snpeff-extra '-no-genome'
          --recursive

Detailed content of the config.yaml

This is a yaml file. The following keys are required (in any order):

# As simple key: value
design: /path/to/design_file.tsv (string)
workdir: /path/to/workdir (string)
threads: maximum number of threads (integer)
singularity_docker_image: name of a docker/singularity image (string)
# As key: list of values
cold_storage:
  - /path/to/cold_storage1 (string)
  - /path/to/cold_storage2 (string)
  ...
# As nested key: key: value
ref:
  GWASCat: /path/to/gwascat.tsv
  GeneSets: /path/to/GeneSets.gmt
  dbNSFP: /path/to/dbNSFP.tsv
params:
  snpeff_extra: Extra parameters (string) for SnpEff
  snpsift_varType_extra: Extra parameters (string) for SnpSift
  snpsift_GWASCat_extra: Extra parameters (string) for Snpsift
  snpsift_GeneSets_extra: Extra parameters (string) for Snpsift
  snpsift_dbNSFP_extra: Extra parameters (string) for Snpsift
workflow
  multiqc: weather to run multiqc or not (boolean)

A complete config.yaml file would look like this:

design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
  - /media
ref:
  GWASCat: /path/to/gwascat.tsv
  GeneSets: /path/to/GeneSets.gmt
  dbNSFP: /path/to/dbNSFP.tsv
workflow:
  multiqc: true
params:
  copy_extra: --parents --verbose
  snpeff_extra: -v
  snpsift_varType_extra: ""
  snpsift_GWASCat_extra: ""
  snpsift_GeneSets_extra: ""
  snpsift_dbNSFP_extra: "-v"

Detailed content of the design.tsv

This is a TSV file describing our analysis. The column order is not relevant. If you want to build it manually, use your favorite tabular-file editor.

It must contain the following columns:

* Sample_id: the name of each samples
* VCF_File: path to the upstream VCF file

The optional columns are:

* VCF_Index: path to tbi-indexed files
* Any other information

An paired-end miniamal-example would be:

Sample_id VCF_File
Sample 1 /path/to/file1.vcf
Sample 2 /path/to/file2.vcf