Skip to content

exp232 follow-up: why doesn't distal (enhancer) VEP improve with training? — hypotheses to investigate #283

@gonzalobenegas

Description

@gonzalobenegas

Observation (#232)

The ccre_non_promoter (enhancer) arm is the lone exception to the diagonal-wins + LL-gap↔AUPRC story. Its distal AUPRC sits at ~0.10–0.13 across the entire training trajectory with no learning trend, and distal is the only subset with a ~0 / negative LL-gap↔AUPRC correlation (Pearson −0.10).

Two clues that shape the question

  • The constraint signal grows; the AUPRC doesn't. The arm's own val_enhancer LL-gap rises 0.037 → 0.098 over 5000 steps — so it is learning some enhancer constraint — yet distal AUPRC stays flat. The bottleneck is the constraint → distal-VEP transfer, not a total failure to learn. (For scale: cds's val_cds gap goes 0.007 → 0.41.)
  • The ccre arm does better off-diagonal on coding than on its own target. From the diagonal it scores splicing 0.238 and synonymous 0.179 — both higher than its own distal 0.127. It is a better splicing predictor than enhancer predictor.

Hypotheses (not committing — full list to investigate)

  • H1 — training-set contamination via conservation filtering (lead hypothesis). After conservation filtering, the ccre_non_promoter partition may be enriched for conserved distal cCREs that overlap or border CDS exons (i.e. sit right at the CDS edge), so the arm partly trains on coding sequence. This would directly explain its strong splicing/synonymous off-diagonal and its weak own-distal signal. → Diagnostic: intersect the v4 ccre_non_promoter training intervals with CDS/exon annotations; quantify CDS overlap and CDS-edge proximity. If confirmed, it spins off a v4 data-curation fix.
  • H2 — heterogeneous target. ccre_non_promoter mixes enhancers + insulators/CTCF + other distal cCREs → a diffuse training target with no sharp enhancer-specific constraint. → Diagnostic: break the partition down by cCRE subtype; consider an enhancer-only arm.
  • H3 — weak / fundamental signal. Enhancers are intrinsically less constrained than coding/splice sites, so there's simply less for a gLM to exploit (gap 0.098 vs cds 0.41). May be a ceiling, not a fixable bug.
  • H4 — model too small. 0.25B may lack the capacity for enhancer grammar. → Caveat: Missense VEP degrades with model scale (uniquely among variant classes) — investigate why #279 shows missense AUPRC degrades with scale, so scale isn't uniformly helpful — check enhancers specifically (e.g., exp187's 1B distal arm).
  • H5 — train longer. The LL-gap is still rising at step 5000 — do it and the AUPRC keep improving past the budget? → Suspect, since the gap already climbs while AUPRC is flat.
  • H6 — eval ceiling / ground truth. The distal mendelian positive set is small (n=58) and effects are diffuse; the ceiling for distal mendelian VEP may just be low. → Diagnostic: compare to exp187 1B distal, a supervised enhancer baseline, or shuffle controls before blaming the gLM.

Cheap first diagnostics when revisited

  1. H1 CDS-overlap intersection — most directly testable and, if true, reframes everything.
  2. exp187 1B distal — does scale help distal at all? (informs H4.)
  3. The ccre arm's own LL-gap-vs-distal-AUPRC scatter — is there any positive regime, or is it flat throughout?

Status

Parking lot — lots of follow-up potential; not committing to a hypothesis. Revisit later.

References: #232 (diagonal + trajectory), #8 (LL-gap↔AUPRC framing), #279 (scale-axis VEP anomaly), #227/#228 (v4 dataset build).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions