Skip to content

Commit eaa9b7c

Browse files
author
James Hawley
committed
README reformatting
1 parent 52f5c72 commit eaa9b7c

File tree

1 file changed

+77
-124
lines changed

1 file changed

+77
-124
lines changed

README.md

Lines changed: 77 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -1,163 +1,116 @@
11
# VSE
2-
VSE is a Perl/Rscript command line tool to calculate the enrichment of associated variant set (AVS) for an array of genomic regions.
32

4-
**Program requirement:**
5-
```
6-
- Perl (5.18 or higher)
7-
- R (3.1.1 or higher)
8-
- bedtools (2.2.4 or higher) and must be globally executable
9-
```
10-
**Perl module required:**
11-
```
12-
- File::Basename
13-
```
14-
**R package required:**
15-
```
16-
- ggplot2
17-
- reshape
18-
- car
19-
```
20-
####Installation:
21-
**Step 1: Install Perl**
22-
23-
You can get the Perl software from [their website](https://www.perl.org/get.html) and install. Version 5.x is required.
24-
You must also install the Perl module ```File::Basename``` (module's CPAN page is [here]( http://search.cpan.org/~nwclark/perl-5.8.6/lib/File/Basename.pm)).
3+
**V**ariant **S**et **E**nrichment
254

26-
In most cases, the following steps will work.
27-
28-
1. First, install cpanMinus if not installed:
5+
## Description
296

30-
```sudo cpan App::cpanminus```
7+
VSE is a Perl/R command line tool to calculate the enrichment of an associated variant set (AVS) for an array of genomic regions.
318

32-
2. Then install File::Basename by the following command:
9+
## Installation
3310

34-
```cpanm File::Basename```
11+
### Environment
3512

36-
If you have issues, you can refer to [CPAN help page](http://www.cpan.org/modules/INSTALL.html) for more details.
13+
* [Perl](https://www.perl.org) (5.18 or higher)
14+
* [R](https://cran.r-project.org) (3.1.1 or higher)
15+
* [bedtools](http://bedtools.readthedocs.io/en/latest/) (2.2.4 or higher) and must be globally executable
3716

38-
**Step 2: Install R**
17+
#### Perl modules
3918

40-
R is also required for calculating the statistics and generating the plots. You can download R at [www.r-project.org](http://www.r-project.org). Version 3.1.1 or greater is required.
19+
* File::Basename
4120

42-
The following R packages are required: ```ggplot2```,```reshape```,```car``` , all of which can be downloaded and installed from CRAN.
21+
#### R packages
4322

44-
In most cases, you can install the packages by:
23+
* ggplot2
24+
* reshape
25+
* car
4526

46-
1. Go to R command line by typing ```R``` in your terminal
47-
2. type
27+
### Download VSE
4828

49-
```
50-
install.packages("ggplot2")
51-
install.packages("reshape")
52-
install.packages("car")
53-
```
29+
Download VSE from this repo via `git clone`.
30+
Alternatively, you can just download and run `VSE.pl`. No other installation is required if you have the required programs already installed.
31+
The directory structure must be intact, however. i.e., `lib` and `data` directories must reside in the same directory as `VSE.pl`.
5432

55-
3. Exit R environment by typing ```q()```
33+
## Usage
5634

57-
```Rscript``` should be an executable command from any location.
35+
### Example
5836

59-
**Step 3: Install bedtools**
60-
61-
You can download and install the latest version of bedtools from [their github repository](https://github.com/arq5x/bedtools2).
62-
63-
In most cases, you can install by this:
64-
65-
1. In terminal, go to downloaded bedtools folder:
66-
67-
```
68-
cd bedtools-2.25.0
69-
```
37+
```
38+
perl VSE.pl -f example.SNPs/NHGRI-BCa.bed \
39+
-l example.SNPs/ld_BCa.bed \
40+
-s run1 -d example.beds \
41+
-v
42+
```
7043

71-
2. Compile the tool by typing
44+
### Parameters
7245

73-
```
74-
make
75-
```
46+
| Parameter | Description |
47+
|-----------|-------------|
48+
| `-f` | BED file containing tagSNPs. Columns: `chr`, `start`, `end`, and `SNP name`. |
49+
| `-l` | BED File containing LD SNPs. Columns: `chr`, `start`, `end`, `LDSNPid`, `tagSNPid`, `other optional columns`. The LD SNPs must be calculated based on EUR population of 1000 Genome Project Phase III genotype data (Release 2013/05). It's important to use the updated genotype information for calculating LD because the newer releases are more accurate than the older ones because of inclusion of more individuals. However, for older studies, VSE will soon support 1000 Genome Project Phase I data as well. |
50+
| `-d` | Genomic ranges of interest. **Either** path to directory containing BED files of genomic regions of interest (the name of the file is used as labels for the regions) **or** path to a BED file (the tallies will be printed on screen and the enrichment analysis will not be run). |
7651

77-
3. Copy the binary programs from ```bin/``` to your executable directory, which is usually ```/usr/bin/``` or ```/usr/local/bin```
52+
### Options
7853

79-
```
80-
sudo cp bin/* /usr/bin/
81-
```
54+
| Option | Description |
55+
|--------|-------------|
56+
| `-s` | Output directory suffix. Output files will be saved in `suffix.output` directory. |
57+
| `-r` | `R^2` value; default 0.8. |
58+
| `-p [all/AVS/MRV/xml/R]` | Modular run; default `all`. |
59+
| `-A` | Suffix for existing `AVS`/`MRV` files. Only functional when `-p` is `xml`. |
60+
| `-h` | Help |
61+
| `-v` | Verbose |
8262

83-
If this does not work, further installation instruction can be found [in their website](http://bedtools.readthedocs.org/en/latest/content/installation.html)
63+
## Output
8464

85-
The ```intersectBed``` should be an executable command from any location.
65+
VSE produces multiple output files in `suffix.output` directory.
8666

87-
**Step 4: Download VSE**
67+
| File | Description |
68+
|------|-------------|
69+
| `suffix.density.pdf` | Density of overlapping tallies from the null |
70+
| `suffix.VSE.stat.txt` | Statistics table |
71+
| `suffix.final_boxplot.pdf` | Visualizes the null distribution and enrichment of `AVS` |
72+
| `suffix.matrix.pdf` | Binary representation of overlapping between each locus and annotation. Overlapping is defined as at least one SNP (associated or linked) is within the annotation. |
73+
| `suffix.VSE.txt` | Matrix of all overlapping tallies by `AVS` and `MRVS`. The first column is the `AVS` tally and the rest are `MRVS`. |
8874

89-
You can just download and run ```VSE.pl```, no other installation required if you have the required programs already installed. The directory structure must be intact; i.e., ```lib``` and ```data``` directories must reside in the same directory as ```VSE.pl```.
75+
## Running VSE
9076

77+
VSE can be run in different parts using `-p` parameter. For the first run, `-p all` or no `-p` is recommended.
78+
However, once you create the `AVS` and `MRVS` for a set of SNPs, you can use `-p xml` and `-p R` for checking the enrichment of the SNPs over new genomic ranges.
79+
Below are examples of certain situations and command lines that to be used:
9180

92-
###Using VSE
93-
#####Example command:
94-
```
95-
perl VSE.pl -f example.SNPs/NHGRI-BCa.bed \
96-
-l example.SNPs/ld_BCa.bed \
97-
-s run1 -d example.beds \
98-
-v
99-
```
81+
### Running for the first time for a set of variants
10082

101-
#####Options:
102-
```
103-
-f bed file for tagSNPs
104-
-l file containing LD SNPs information
105-
-d full path to directory containing all bed files or path to a single bed file
106-
-s output directory suffix. Output files will be saved in suffix.output directory
107-
-r [0.6/0.7/0.8/0.9/1] r2 value; default 0.8
108-
-p [all/AVS/MRV/xml/R] modular run; default all
109-
-A Suffix for existing AVS/MRV files. Only functional when -p is xml.
110-
-h help
111-
-v verbose
112-
```
113-
####Input
114-
VSE requires three input files.
115-
######The tagSNPs:
116-
It should be a standard tab delimited bed file and provided with ```-f``` parameter. There should be no header line.
117-
Example:
118-
```
119-
chr1 1000 1001 rs00001
120-
```
121-
######The LD SNPs
122-
The list of LD SNPs should be a **tab delimited** bed file in the following format and should be provided with ```-l``` parameter.
83+
```shell
84+
perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/histone_marks/ -s run1 -v
12385
```
124-
chr1 900 901 LDSNPid tagSNPid other_optional_columns
125-
```
126-
There should be no header line. The LD SNPs must be calculated based on EUR population of 1000 Genome Project Phase III genotyping data (Release 2013/05). It's important to use the updated genotyping information for calculating LD because the newer releases are more accurant than the older ones because of inclusion of more individuals. However, for older studies, VSE will soon support 1000 Genome Project Phase I data as well.
127-
128-
######The genomic ranges
129-
The genomic regions should be a single directory and the directory path should be given in ```-d``` parameter. The genomics regions should be bed files. The name of the file is used as labels for the regions. For examples, ```DHS.bed``` will be denoted as DHS in final figures and table.
130-
```-d``` can also be a path to a bed file. In that case, the tallies will be printed on screen and the enrichment analysis will not be run.
13186

132-
####Output
133-
VSE produces multiple output files in suffix.output directory.
87+
### Running the same set of SNPs but for a different batch of genomic regions
13488

135-
```suffix.density.pdf``` shows the density of overlapping tallies from the null
136-
137-
```suffix.VSE.stat.txt``` contains the statistics table
89+
```shell
90+
perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p xml -A run1 -s run2 -v
91+
```
13892

139-
```suffix.final_boxplot.pdf``` visualizes the null distribution and enrichment of AVS
93+
This command line will use the `AVS` and `MRVS` outputted from run1 and will produce new matrix file in `run2.output` directory. Then you can run `-p R` to compute enrichment and generate the plots.
14094

141-
```suffix.matrix.pdf``` binary representation of overlapping between each locus and annotation. Overlapping is defined as at least one SNP (associated or linked) is within the annotation.
95+
```shell
96+
perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p R -A run1 -s run2 -v
97+
```
14298

143-
```suffix.VSE.txt``` is a matrix of all overlapping tallies by AVS and MRVS. The first column is the AVS tally and the rest are MRVS.
99+
### Running for just one genomic region file
144100

145-
####Running VSE
146-
VSE can be run in different parts using ```-p``` parameter. For the first run, ```-p all``` or no ```-p``` is recommended. However, once you create the AVS and MRVS for a set of SNPs, you can use ```-p xml``` and ```-p R``` for checking the enrichment of the SNPs over new genomic ranges. Below are examples of certain situations and command lines that to be used:
147-
######Running for the first time for a set of variants:
148-
```perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/histone_marks/ -s run1 -v```
149-
######Running the same set of SNPs but for a different batch of genomic regions:
150-
```perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p xml -A run1 -s run2 -v```
101+
```shell
102+
perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/POL2_binding.bed -s run-pol2 -v
103+
```
151104

152-
This command line will use the AVS and MRVS outputted from run1 and will produce new matrix file in ```run2.output``` directory. Then you can run ```-p R``` to compute enrichment and generate the plots:
105+
This will output the overlapping tallies for `AVS` and `MRVS` for POL2 on the screen. You can copy this line to any other `*.VSE.txt` file from other experiments.
106+
For example, you can add the line to `run2.output/run2.VSE.txt` from the `TF_binding` analysis (run2). You can then run `-p R` to redo the enrichment analysis, now including POL2.
153107

154-
```perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p R -A run1 -s run2 -v```
155-
######Running for just one genomic region file:
156-
```perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/POL2_binding.bed -s run-pol2 -v```
108+
```shell
109+
perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p R -A run2 -s run2_with_pol2 -v
110+
```
157111

158-
This will output the overlapping tallies for AVS and MRVS for POL2 on the screen. You can copy this line to any other *.VSE.txt file from other experiments. For example, you can add the line to ```run2.output/run2.VSE.txt``` from the TF_binding analysis (run2). You can then run ```-p R``` to redo the enrichment analysis, now including POL2: ```perl VSE.pl -f tagSNPs.bed -l LDSNPs.bed -d /path/TF_binding/ -p R -A run2 -s run2_with_pol2 -v```
112+
## Factors to consider
159113

160-
#####There are several factors to consider:
161114
1. VSE is sensitive to the number of tagSNPs. From our trial and error tests, too low number of tagSNPs (below 15) provide imprecise result.
162-
2. The quality of ChIP-seq data is very important. We recommend users to confirm the quality of the ChIP-seq data and to only use data that are of good quality to avoid false enrichment. There are tools like ChIPQC or Chillin for quality control of ChIP-seq data.
163-
3. Make sure that you use the same r2 cutoff that you used to determine your LD SNPs. Also, the LD SNPs must be calculated using 1000 Genome Project Phase III (May 2013) release.
115+
1. The quality of ChIP-seq data is very important. We recommend users to confirm the quality of the ChIP-seq data and to only use data that are of good quality to avoid false enrichment. There are tools like [ChIPQC](http://bioconductor.org/packages/release/bioc/html/ChIPQC.html) or [Chillin](https://doi.org/10.1186/s12859-016-1274-4) for quality control of ChIP-seq data.
116+
1. Make sure that you use the same `R^2` cutoff that you used to determine your LD SNPs. Also, the LD SNPs must be calculated using 1000 Genome Project Phase III (May 2013) release.

0 commit comments

Comments
 (0)