Description
Hello @ArtRand
Thank you for such a nice package. I ran this for my analysis (ignoring positions with low valid coverage (<5X) & considering only effect sizes greater than 10%). I am investigating both 5mC and 5hmC and choose to not combine strands :
modkit dmr pair \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC2_pileup.bed.gz \
-a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD2_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC1_pileup.bed.gz \
-b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD1_pileup.bed.gz \
-o /Users/MC/Desktop/UNP_aftervsbefore_4ind_single-base_5X_delta10.txt \
--header \
--ref /Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna \
--delta 0.1 \
--base C \
--threads 16 \
--min-valid-coverage 5 \
--log-filepath /Users/MC/Desktop/DMC_UNP_aftervsbefore_4ind_single-base_5X_delta10.log
which returned
> reading reference FASTA at "/Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna"
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 1667716 a records and 1346810 b records, calculating max coverages for 95th percentile
> calculated max coverage for a: 17 and b: 16
> min valid coverage set to 5
> running with replicates and matched samples
> finished, processed 52301679 sites successfully, 0 failed
I further filtered out p-values pv<0.05 and required 75% of samples to be used in statistical test (which corresponds to 3 out of 4 individuals in my case). I end up with 458,258 "significant" CpGs, which is huge I find (I historically worked with WGBS data and the number of significant DMC is in the magnitude of hundreds). I should add that setting up pv<=0.01 returns only 73,964 significant CpGs. My volcano plot looks normal (please see below) and when I look at the CpGs individually in IGV, they are indeed significant.
- Should I consider adding extra filters to reduce the amount of significant sites?
- Should I consider a greater effect size and a pv<0.01 (represented by the vertical and horizontal lines)?
- Should I increase the coverage per cytosine during dmr pair? What do you think?

I read here that you found that the Benjamini-Hochberg procedure works pretty well with single site analysis. So I calculated FDR on the 458,258 significant pvalues by doing UNP$FDR <- p.adjust(UNP$map_pvalue, method = "BH")
and ended up with the exact same FDR for all of them = 0.04993573. Since I did 458,258 tests and 0.04993573*458,258 = 22,883
does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites. Again, is this something I should expect with Nanopore data? I would be most grateful if you could share your best practices for FDR correction.