Single sites analysis - too many significant sites and FDR correction

Hello @ArtRand 

Thank you for such a nice package. I ran this for my analysis (ignoring positions with low valid coverage (<5X) & considering only effect sizes greater than 10%). I am investigating both 5mC and 5hmC and choose to not combine strands :
```
modkit dmr pair \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC2_pileup.bed.gz \
  -a /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD2_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivA1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivB1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivC1_pileup.bed.gz \
  -b /Volumes/Nicole/pileup_filtering_mod_base_calls/bothmarks/indivD1_pileup.bed.gz \
  -o /Users/MC/Desktop/UNP_aftervsbefore_4ind_single-base_5X_delta10.txt \
  --header \
  --ref /Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna \
  --delta 0.1 \
  --base C \
  --threads 16 \
  --min-valid-coverage 5 \
  --log-filepath /Users/MC/Desktop/DMC_UNP_aftervsbefore_4ind_single-base_5X_delta10.log
```
 
which returned 
```
> reading reference FASTA at "/Users/MC/Downloads/ncbi_dataset/ncbi_dataset/data/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna"
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 1667716 a records and 1346810 b records, calculating max coverages for 95th percentile
> calculated max coverage for a: 17 and b: 16
> min valid coverage set to 5
> running with replicates and matched samples
> finished, processed 52301679 sites successfully, 0 failed

```

I further filtered out p-values pv<0.05 and required 75% of samples to be used in statistical test (which corresponds to 3 out of 4 individuals in my case). I end up with 458,258 "significant" CpGs, which is huge I find (I historically worked with WGBS data and the number of significant DMC is in the magnitude of hundreds). I should add that setting up pv<=0.01 returns only 73,964 significant CpGs. My volcano plot looks normal (please see below) and when I look at the CpGs individually in IGV, they are indeed significant. 

1. Should I consider adding extra filters to reduce the amount of significant sites? 
2. Should I consider a greater effect size and a pv<0.01 (represented by the vertical and horizontal lines)? 
3. Should I increase the coverage per cytosine during dmr pair? What do you think? 

<img width="560" alt="Image" src="https://github.com/user-attachments/assets/ea879841-217e-433a-9dea-27456eaa25ab" />

I read [here](https://github.com/nanoporetech/modkit/issues/191) that you found that the Benjamini-Hochberg procedure works pretty well with single site analysis.  So I calculated FDR on the 458,258 significant pvalues by doing `UNP$FDR <- p.adjust(UNP$map_pvalue, method = "BH")` and ended up with the exact same FDR for all of them = 0.04993573. Since I did 458,258 tests and `0.04993573*458,258 = 22,883` does this mean that I should expect 22,883 false discoveries - which corresponds to 1/20th of the sites. Again, is this something I should expect with Nanopore data? I would be most grateful if you could share your best practices for FDR correction.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single sites analysis - too many significant sites and FDR correction #397

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single sites analysis - too many significant sites and FDR correction #397

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions