-
Notifications
You must be signed in to change notification settings - Fork 4
3. Tutorial
The directory SAFFARI/resources/ includes example files you can use to test and run the fine-mapping pipeline.
The top loci file Mullins_2021_toploci.csv contains the top GWAS loci with genome-wide significance (P < 5 × 10−8), while the file Mullins_2021_loci_ranges.tsv is the output after running the fetch_UKB_LD_names module (i.e., formatted top loci file).
Make sure to also download:
(i) the baseline-LF2.2 UKB annotations
(ii) the UKB LD matrices and/or plink bfiles for LD estimation (depending on the LD format of your data) within the resources/ directory.
To ease the running process process, the files/folders shall be named as follows:
(i) baseline-LF2.2 UKB annotations folder as UKBB_priors
(ii) the UKB LD panel as UKBB_LD
(iii) the plink bfiles for LD estimations as genotype_ref_panel
Links to download these files can be found in the Dependencies Wiki page.
All data were retrieved from
Mullins, N., Forstner, A.J., O’Connell, K.S. et al.
Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology.
Nat Genet 53, 817–829 (2021).
Downloaded from the PGC website.
Step-by-step guide using GWAS summary statistics from Mullins et al., 2021
- Install Snakemake following the instructions as provided here.
- Set up a Snakemake job profile for HPC clusters using instructions provided here
- Activate your conda Snakemake environment simply by running
conda activate snakemake.
- Download all Dependencies and the SAFFARI Github repository.
- In your local HPC cluster, create one repository that will be your SAFFARI working repository (e.g., could be named as
sandbox). This is where all files, scripts, Snakefiles, resources will be stored.
- The Snakefile that has to be executed is the
fetch_UKB_LD_names_multipleas this workflow will reformat yourtoploci.csvfile and add as another option 3Mb windows for fine-mapping. The config file to be used isconfig_alt.yaml. The latter ones will comply as chromosomal coordinates with the UKBB_LD files.
To do this, you will runsnakemake --profile lsf --configfile config_alt.yaml --use-conda
- Next Snakefile that has to be executed is the fine-mapping workflow. If you wish to use UKB LD panel, you can opt for
finemapping_multiple. Alternatively, you can opt for the HRC LD panel and use thefinemapping_HRC_multiple. In that case, you will need to use theconfig.yamlfile. To do this, you will runsnakemake --profile lsf --configfile config.yaml --use-conda
- Please note that if you don't set up a HPC Snakemake profile, you can define the path to your executed Snakefile in your command, e.g:
snakemake --s workflow/Snakefile --configfile config.yaml --use-conda --cores 3.
If you opt to set up one, then each Snakefile has to be renamed before executing the workflow asSnakefile.
- You can define any range for fine-mapping by
--startand--endflags of the fine-mapping rules within the Snakefile.
-
To run this Snakemake pipeline with the different "modules", you will need two main inputs: (i) formatted and cleaned GWAS summary sumstats ( in a .gz format), and, (ii) a list of top GWS loci for fine-mapping (stored as a .csv file).
Both the top loci file and the GWAS sumstats should include the respective columns as outputted from RICOPILI. Please check the files within the resources folder in this Github repository to get a better overview of the essential columns needed for both inputs. -
GWAS summary statistics should be ideally QC'ed and any duplicate SNPs shall be removed beforehand.
In our workflow here, we can exclude SNPs only based on the INFO and MAF columns. If you need to exclude additional SNPs according to the MAF, then FRQ RICOPILI columns shall be renamed prior to pipeline executions.
An example of GWAS columns from RICOPILI-based GWAS summary statistics include:
CHR SNP BP A1 A2 FRQ_A_41917 FRQ_U_371549 INFO OR SE P Nca Nco Neff_half
The script tries to be flexible and accommodate multiple file formats and column names. Minimum fields include a sample size parameter (n) and a whitespace-delimited input file with SNP rsids, chromosome and base pair info, and either a p-value, an effect size estimate and its standard error, a Z-score or a p-value.)
Some acceptable GWAS summary statistics column names:
- chromosome column = ['CHR', 'CHROMOSOME', 'CHROM']
- bp column = ['BP', 'POS', 'POSITION', 'COORDINATE', 'BASEPAIR']
- snp column = ['SNP', 'RSID', 'RS', 'NAME', 'MarkerName'])
- A1 frequency column = ['A1FREQ', 'freq', 'MAF', 'FRQ']
- info column = ['INFO']
- beta column = ['BETA', 'EFF', 'EFFECT', 'EFFECT_SIZE', 'OR']
- se column = ['SE']
- pvalue column = ['P_BOLT_LMM', 'P', 'PVALUE', 'P-VALUE', 'P_value', 'PVAL']
- z column = ['Z', 'ZSCORE', 'Z_SCORE'],
- n column = ['N', 'sample_size']
- ncase column = ['N_cases', 'Ncase', 'Nca','Total_NCase']
- ncontrol column = ['N_controls', 'Ncontrol','Nco','Total_NControl'], allow_missing=True)
- allele1 column = ['ALLELE1', 'A1', 'a1', 'a_1']
- allele2 column = ['ALLELE2', 'A2', 'a2', 'a_2']
- The top loci file is derived from the RICOPILI clumping procedure and should include the minimum fields as seen below. Make sure you store this as a
_toploci.csvfile. Here,range.leftandrange.rightare customized fine-mapping ranges.
SNPCHRBPrange.leftrange.right
The outputs are split per assessed GWAS and include the following folders:
priors, and then the results from the four fine-mapping methods categorized according to the LD panel + method combination. For example, if you used UKB as a reference panel, you will find these folders: polyfun_finemap_UKB_finemap, only_finemap_UKB_finemap, only_susie_UKB_finemap, polyfun_susie_UKB_finemap.
The priors folder contains the functional priors which will be used for the fine-mapping. Please see also the Polyfun Wiki.
The other four folders contain the fine-mapping results per LD panel + method combination for every selected locus (denoted with a specific rsID based on your top loci file). The folder also contains a concatenated/merged version of all these results with the extension _all.txt.gz. Each output file per fine-mapped locus contains SNPs ranked according to their posterior inclusion probability and their inclusion in a 95% credible set.
Detailed information about the columns contained either within the per locus fine-mapping results files and the concatenated/merged versions can be found in the Polyfun Wiki.