Redundant motifs from `modkit motif search`

Hello! 

I'm using modkit for a bacterial (epi)genomics project. I have a large, GC-rich bacterial genome with several methyltransferases which may or may not be expressed under culture conditions. I'd like to find out the recognition sequences of these enzymes and how many of them are active (how many unique methylated motifs are there?), and `motif search` seems like the tool for that job.

My only issue is that when I search for motifs exhaustively (using the defaults for min-sites, log-odds and fraction in the high modified), I get a few confident (log_odds > 10), parsimonious motifs followed by a number of redundant entries. These either contain another motif as a substring (e.g. AGCT is highly methylated, GNNNNSAGCTSNNG gets trawled up as its own motif), or a permutation of a motif with a canonical nucleotide replaced by an IUPAC ambiguity (e.g. GCCGGC turns into GCCGGNG), or a bunch of motifs split off from a more parsimonious motif with degenerate bases (SAGCTS becomes GAGCTC, CAGCTC, ...)

For example, here's the first 12 results from my attempt:
`modkit pileup -t 16 --filter-threshold 0.66 bam/alignment_sorted.bam pileup/mods.bed` 
`modkit motif search -t 16 -i pileup/mods.bed -r alignment/ref.fasta -o motifs/motifs.tsv # all defaults`
![image](https://github.com/user-attachments/assets/e7d2b076-28fa-4f77-ac6b-ea741deec207)

Modkit confidently discovers the GCCGGC motif, but it also a bunch of redundant entries containing the GCCGGC motif:
m **GCCGGC** 2
m GNNNGNNGG**GSCGGN**G 11 -> GNNNGNNGG**GCCGGC**G | two substitutions, S->C, N->C
m GGG**GCCGGN**G 5                    -> GGG**GCCGGC**G | one substitution, N->C
m GNNGGG**GCCGG** 8                 -> GNNGGG**GCCGGC** | insert one C at the end
m GNNNGGGGCCGG 9                       -> GNNNGGG**GCCGGC** | insert one C at the end
seem to just be permutations of each other in different genome contexts. 

I guess my question is how to approach reporting unique motifs. Is there a principled way to filter out redundant entries or align and cluster them? Can I adjust the parameters of `motif search` or `motif refine` to be more permissive about merging sequences? 

I am in no way a Bioinformatics Expert so if I'm missing something obvious or making a mistake processing my data or setting parameters, feel free to let me know. Also happy to send over any data or set up a reproducible example.

Thanks,
Galen


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant motifs from `modkit motif search` #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redundant motifs from modkit motif search #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Redundant motifs from `modkit motif search` #272