Allele-specific Binding from ChIP-Seq
ABC identifies potential allele-specific binding events at known heterozygous positions within the aligned reads of a ChIP-Seq experiment.
ABC requires at a minimum two (2) files:
- A sorted BAM/SAM file of a ChIP-Seq experiment
- A file containing the position, strand and allele information of heterozygous single nucleotide variants (SNVs) (SNPs and/or mutations)
ABC calls allele-specific binding by identifying a bias in the distribution of the SNV alleles while attempting to control for potential false positives. If you have genomic sequence data you can use the allele ratio in the DNA as the expected frequency to control for chromosome copy number.
Care should be taken in the selection of SNVs. Homozygous SNVs or SNVs within duplications will appear to have strong allele-specific effects. It is recommended that you filter SNPs prior to running ABC (i.e., those mapping to a motif). In addition, ABC does not filter duplicated reads, therefore the user may wish to remove duplicated reads prior to running ABC.
Perl and R should be installed on the system; instructions regarding their installation can be found from http://www.perl.org/get.html and http://cran.r-project.org, respectively.
Additionally the perl module Statistics::R should be installed using the following command:
cpan Statistics::RFurther information on how to install Perl modules can be found here: http://www.cpan.org/modules/INSTALL.html.
ABC can be downloaded from this repo, using the git clone command.
perl ABC.pl --align-file <input.sam> --snv-file <snv filename> --out <output filename>| Parameter | Description |
|---|---|
--align-file |
Specify the ChIP-Seq BAM/SAM file. Note: the BAM/SAM file must be sorted. If a BAM file is supplied ABC will create a temporary SAM file. |
--snv-file |
A tab-delimited text file containing the list of heterozygous SNPs. Important: the CHR column should match the build used for the alignments (ie. chr# for hg19 or just the # for b37). It is the responsibility of the user to verify the quality of the SNVs (i.e., do they pass Hardy-Weinberg equilibrium, etc...). |
The format of the SNV file is as follows (Do not include the header):
SNV_ID CHR POSITION STRAND REF_ALLELE OBSERVED_ALLELES ALLELE_RATIO_DNA
rs111 chr1 1111 + A A/C 0.5
SNV1 chr10 10234 - C C/G 0.4
| Option | Description |
|---|---|
--bg |
Specify a .bedgraph file capturing the ChIP-Seq signal (shifted read pileups) of the BAM/SAM file. This can be useful to prioritize SNPs with low coverage, since they may fall within centre of the +ve and -ve strand peaks. This phenomenon is caused by short reads and is not necessary for longer reads. |
--out |
Specify the output file prefix (default ABC; i.e., ABC will create two output files ABC.dist and ABC.align) |
--min-reads |
The minimum number of reads covering the a SNVs (default: 25; ie. ABC will report only those SNPs/mutations with # reads overlapping them). |
--d |
Divide chromosomes into segments until reaching d lines for faster retrieval (default: 2000). A large BAM/SAM file can take a long time to process. You may try to increase the value of d. However, very large numbers will not necessarily increase speed. Consider splitting the alignment file by chromosome. |
--mw-thres |
P-value threshold for the Mann-Whitney U test used to test a bias in the read position between the SNV alleles. (default: 0.05). To report all SNVs set this parameter to 0. |
--f-thres |
P-value threshold for the Fisher's exact test used to test for a bias between the strand distribution of the SNV alleles. (default: 0.05). To report all SNVs set this parameter to 0. |
--min-mapq |
Set the minimum allowed MAPQ score for each read (default: 0). |
--verbose |
Print progress and SNV results summary to screen. |
--help |
Prints command line options. |
This step requires that the ABC has finished running. Once ABC is finished a figure can be generated by specifying the output file prefix used in the initial run and the SNV ID as follows:
Usage: perl ABC.pl --out --visualize-snv
A table of the results can be found in the output file with the .dist extension.
| Column Name | Description |
|---|---|
SNV |
SNV Identifier |
CHR |
Chromosome |
BP |
Position on chromosome |
REF |
Reference allele |
OBS |
Observed alleles |
A1 |
Allele 1 |
N_A1 |
Number of reads containing A1 |
F_A1 |
Frequency of A1 |
N_A1_POS |
Number of reads aligned to the +ve strand containing A1 |
N_A1_NEG |
Number of reads aligned to the -ve strand containing A1 |
A2 |
Allele 2 |
N_A2 |
Number of reads containing A2 |
F_A2 |
Frequency of A2 |
N_A2_POS |
Number of reads aligned to the +ve strand containing A2 |
N_A2_NEG |
Number of reads aligned to the -ve strand containing A2 |
N_TOTAL |
Total number of reads overlapping a SNV, including: N_ERRORS, N_OMITTED, MISSING_N |
N_ERRORS |
Number of reads containing a nucleotide other than A1 or A2 |
N_OMITTED |
Number of reads where the SNV falls within a clipped or deleted region |
MISSING_N |
Number of reads containing an ambigous nucleotide N |
MAX |
The maximum value of from the .bedgraph file (if specified) within the 2x read length window centered on the SNV |
P_BINOM |
P-value used to call allele-specific binding |
P_MANN_WHIT |
P-value used to call a bias in read position between alleles |
P_FISHER |
P-value used to call a bias in the strand distribution between alleles |
P_CHISQ |
Same as P_FISHER. Greater number of reads required |
P_STRAND |
A binomial p-value of the differences in strand abundance |
A1_Position |
Position of A1 within reads |
A1_Strand |
Strand of reads containing A1 |
A2_Position |
Position of A2 within reads |
A2_strand |
Strand of reads containing A2 |
The alignments separated by the alleles of each SNV can be viewed in the output file with the .align extension.
To test ABC, users should download the test SAM file (766MB compressed using gzip, MD5sum bbbac4164553231c77622e5bd8b40fa0). The file is a sorted SAM file of the aligned ChIP-Seq reads for the FOXA1 in MCF7 cells.
- Decompress
ERR022033.sorted.sam.gzusinggunzip. - Download FOXA1 ChIP-Seq data, GSM631471.
- Convert
ERR022033.srato a.fastqfile with the sratoolkit usingfastq-dump ERR022033.sra. - Align the resulting
ERR022033.fastqfile to the reference genome using Bowtie2 (Langmead and Salzberg, 2012) and converted to a.bamfile withsamtools(Li, et al., 2009).
bowtie2 –x /path to index/hg19 –U ERR022033.fastq | samtools view –bS - > ERR022033.bam- Sort the
.bamfile withsamtoolsand converted back to a .sam file.
samtools sort ERR022033.bam ERR022033.sorted
samtools view -h -o ERR022033.sorted.sam ERR022033.sorted.bam- Create a tab-delimited text file containing the SNV information. In this case study, we have created
ABC_SNPs.txt, available in this repo, which contains the following line once decompressed usinggunzip.
rs4784227 chr16 52599188 + C C/T 0.5- Run ABC using the following command in the terminal.
perl ABC.pl --snv-file ABC_SNPs.txt -–align-file ERR022033.sorted.sam --out ABC_SNPs- Visualize the results (limited to one SNP at a time) with the following command.
perl ABC.pl --out ABC_SNPs --visualize-snv rs4784227ABC will output a .pdf file (rs4784227.ABC.SNPs.align.pdf) of the read distributions for the specified SNV