Skip to content

Develop a cross-species enhancer classifier #96

@gonzalobenegas

Description

@gonzalobenegas

Goals

  1. Predict enhancer locations in non-model organisms from arbitrary genome FASTA files, without requiring species-specific experimental data (e.g., ChIP-seq, ATAC-seq).

  2. Identify specifically enhancers, not other functional elements. We already have tools for coding regions, UTRs, promoters, and repeats. The classifier should isolate enhancers from all other element types.

  3. Prioritize functional enhancers over biochemically-defined ones. Biochemical activity (e.g., chromatin marks) is how databases like ENCODE SCREEN catalogue enhancers, but statistical genetics shows that enhancers as a whole are not that enriched for heritability of complex traits (Mammalian evolution of human cis-regulatory elements and transcription factor binding sites; Leveraging base-pair mammalian constraint to understand genetic variation and human disease). It is the conserved enhancers that matter most -- conservation at the primate scale (even more so than mammalian) tends to be the most enriched signal for human genetics. The goal is to combine biochemical and conservation information to curate the highest-quality training data under the strongest functional constraints.

  4. The ultimate evaluation is downstream model performance. The end goal is to train genomic language models on the curated enhancer data and evaluate how well they perform on variant effect prediction in enhancers. Classification metrics (AUROC, AUPRC) are useful proxies, but they can be misleading: a high AUPRC may simply reflect an easy task (e.g., poorly chosen negatives or trivial splits) rather than a meaningful signal. Whether the classifier is actually useful can only be validated downstream, by training gLMs on the predicted enhancers and measuring their performance.

  5. Scalable inference. The method must be fast enough to run on hundreds of whole genomes.

  6. Prioritize precision over recall in enhancer curation. Our working hypothesis is that for training genomic language models, false positives (non-functional sequences labeled as enhancers) might be more damaging than false negatives (missed real enhancers). The intuition is that non-functional DNA in the training set directly corrupts the learned sequence patterns, while missing some real enhancers only reduces training set size — a much more tolerable cost. If this holds, we should err on the side of aggressive filtering, and it's better to discard some real enhancers than to include junk. This principle would guide threshold choices for conservation, repeat, and functional element filters.

Assumptions

  1. High-quality enhancer databases exist for human. ENCODE SCREEN provides well-curated cCREs, and well-calibrated conservation scores (phastCons, phyloP) allow reliable identification of functional enhancers in the human genome.

  2. Other species have weaker or uncertain annotations. ENCODE SCREEN also covers mouse, and conservation scores are available, so we can probably identify functional enhancers there -- but with less confidence than in human. For more distant species, reliable enhancer databases may not exist at all.

  3. Genome annotations are available. We assume access to gene annotations (i.e., GTF files) providing coding regions, UTRs, promoters, etc., as well as RepeatMasker annotations for repeat elements. We don't need to predict these, and we can use this information to help predict enhancers specifically.

Related work

Resources

Database URL Paper
ENCODE SCREEN (cCREs) https://screen.wenglab.org/ Expanded encyclopaedias of DNA elements in the human and mouse genomes (Nature, 2020); An expanded registry of candidate cis-regulatory elements (Nature, 2026)
EnhancerAtlas 2.0 http://enhanceratlas.org/ EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species (NAR, 2020)
SEA (Super-Enhancer Archive) v4.0 http://sea4.edbc.org/ SEA version 4.0: a major expansion and update of the Super-Enhancer Archive (NAR, 2026)

Design questions

Positive set

Base positives are ENCODE SCREEN cCREs classified as enhancer-like signatures (dELS, pELS). Optional filters to increase quality:

  • Conservation filter: require a minimum number of conserved bases (e.g., primate-level phastCons) within the element's core region.
  • Repeat filter: remove elements overlapping repetitive regions (RepeatMasker).
  • Functional element filter: remove elements overlapping other known functional regions (CDS, UTRs, promoters) to isolate enhancer-specific signal. This is particularly important when using conservation filtering, since conserved regions will include many coding sequences.

Negative set

Key decisions:

  • Ratio: balanced 1:1 ratio, or a fixed over-representation of negatives (e.g., 4:1, 9:1), or the natural genomic complement.
  • Sampling strategy: random genomic windows, or matched on properties like GC content and repeat content.
  • Composition: whether to enrich negatives with non-conserved enhancers (biochemically active but not conserved), so the classifier learns to distinguish conserved/functional enhancers from non-conserved ones.
  • Train vs. eval: the negative set can differ between training and evaluation. Training may use a balanced set, while evaluation could use the whole genome (natural class proportions) for more realistic performance estimates.

Splits

  • Single-species: hold out one chromosome (e.g., human chr19 for validation).
  • Multi-species: hold out entire species or specific chromosomes from other species (e.g., leave out all mouse data, or leave out specific mouse chromosomes).

Modeling approaches

  • Train from scratch (e.g., gkmSVM)
  • Finetune gLM
  • Finetune S2F (e.g., AlphaGenome)

Results

In progress. Iterations will be documented in issue comments; this section will be kept as a live summary of findings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions