#Assessing k-mer Enrichment of FASTA Sequences Using Empirical p-values
CBB752 Final Project 2.6, R card, by Julian Q Zhou
Given a FASTA file and a k-mer of interest, assess enrichment of the k-mer in the sequence(s) contained in the FASTA file.
Available here
r_kmer_enrich.R: main script
- A FASTA file that looks like below (e.g.
sample_input_mapk3.fa):
>gi|568815582|ref|NC_000016.10|:30113184-30124230 Homo sapiens chromosome 16, GRCh38.p2 Primary Assembly
CCAGGCTGCGCCAGCATCTTCTTCCTCGTCCACACCCTGCCCTGCCACTTCGCTCTCCTTCTCTCTTGGT
CCCTGCCCCGTTTCTAGCATGCCCCCTTGGACCTACCCCTCTGTGCTGTCCACTTTGGCACCTGTTCTCA
CCCCTACCCGGCTCACCTCCTCGGTGGGCCCCCAGGCGGATGCGGAAGGTGGGAGCCCTGGGCGTGTGCA
GCAGATGAGGCCGGCGCAGGAAGAAGATGGAGAGCATGGCATAGCTGCCCAGGGCAGGGAGGGCATAGTA
- A comma-separated .csv file that looks like below (e.g.
sample_output_cd47_brca2.csv):
| kmer.hits | kmer.p | count.min | count.q1 | count.mean | count.median | count.q3 | count.max | kmer.p.adj | significant |
|---|---|---|---|---|---|---|---|---|---|
| 44 | 0.390625 | 1 | 26 | 56.2412109375 | 53 | 76 | 534 | 0.4921875 | 0 |
| 87 | 0.4921875 | 2 | 49 | 95.1396484375 | 87.5 | 128 | 1128 | 0.4921875 | 0 |
is.enriched.wrapper(input.fasta, target.kmer, output.csv, significance = 0.05, multiple.testing = NULL)
input.fasta: FASTA file containing (possibly multiple) sequences; case-insensitivetarget.kmer: a string of k-mer of interest; case-insensitiveoutput.csv: filename of output csv filesignificance: significance level; between 0 and 1multiple.testing: ifNULL, no correction for multiple testing; otherwise, one of the possible arguments formethodinp.adjust()(i.e.bonferroni,holm,hochberg,hommel,BH,fdr, orBY)
The output csv file contains the following columns for each input sequence:
kmer.hits: number of times the k-mer of interest appears in the sequencekmer.p: empirical p-value for enrichment of k-mer of interestkmer.p.adj: if applicable, adjusted p-value for enrichment of the k-mer of interestcount.min/q1/mean/median/q3/max: summary statistics of counts of all k-mers presentsignificnt:0/1;1ifkmer.p(.adj) <significance
For a given sequence, r_kmer_enrich.R computes the frequency of all the k-mers present in that sequence. Based on this distribution, the empirical (one-tailed) p-value for enrichment of the k-mer of interest is computed. If input contains multiple sequences, correction for multiple testing can be performed using a method of the user's choice (multple.testing). The empirical p-value (adjusted p-value in the case of correcting for multiple testing) is then compared to a user-defined significance level, and a value of 1 indicates significance (p-value < significance).
is.enriched.wrapper(input.fasta = 'sample_input_mapk3.fa', target.kmer = 'GTC', output.csv = 'sample_output_mapk3.csv', significance = 0.01, multiple.testing = NULL)
See sample_output_mapk3.csv for output.
is.enriched.wrapper(input.fasta = 'sample_input_cd47_brca1.fa', target.kmer = 'ATTGC', output.csv = 'sample_output_cd47_brca1.csv', significance = 0.01, multiple.testing = 'fdr')
See sample_output_cd47_brca1.csv for output.