exp232 follow-up: why doesn't distal (enhancer) VEP improve with training? — hypotheses to investigate

### Observation (#232)

The `ccre_non_promoter` (enhancer) arm is the lone exception to the diagonal-wins + LL-gap↔AUPRC story. Its **distal** AUPRC sits at ~0.10–0.13 across the *entire* training trajectory with no learning trend, and distal is the only subset with a ~0 / negative LL-gap↔AUPRC correlation (Pearson −0.10).

### Two clues that shape the question

- **The constraint signal grows; the AUPRC doesn't.** The arm's own `val_enhancer` LL-gap rises **0.037 → 0.098** over 5000 steps — so it *is* learning some enhancer constraint — yet distal AUPRC stays flat. The bottleneck is the **constraint → distal-VEP transfer**, not a total failure to learn. (For scale: cds's `val_cds` gap goes 0.007 → 0.41.)
- **The ccre arm does *better* off-diagonal on coding than on its own target.** From the diagonal it scores splicing **0.238** and synonymous **0.179** — both *higher* than its own distal **0.127**. It is a better splicing predictor than enhancer predictor.

### Hypotheses (not committing — full list to investigate)

- **H1 — training-set contamination via conservation filtering (lead hypothesis).** After conservation filtering, the `ccre_non_promoter` partition may be enriched for conserved distal cCREs that **overlap or border CDS exons** (i.e. sit right at the CDS edge), so the arm partly trains on coding sequence. This would directly explain its strong splicing/synonymous off-diagonal *and* its weak own-distal signal. → *Diagnostic: intersect the v4 `ccre_non_promoter` training intervals with CDS/exon annotations; quantify CDS overlap and CDS-edge proximity. If confirmed, it spins off a v4 data-curation fix.*
- **H2 — heterogeneous target.** `ccre_non_promoter` mixes enhancers + insulators/CTCF + other distal cCREs → a diffuse training target with no sharp enhancer-specific constraint. → *Diagnostic: break the partition down by cCRE subtype; consider an enhancer-only arm.*
- **H3 — weak / fundamental signal.** Enhancers are intrinsically less constrained than coding/splice sites, so there's simply less for a gLM to exploit (gap 0.098 vs cds 0.41). May be a ceiling, not a fixable bug.
- **H4 — model too small.** 0.25B may lack the capacity for enhancer grammar. → *Caveat: #279 shows missense AUPRC **degrades** with scale, so scale isn't uniformly helpful — check enhancers specifically (e.g., exp187's 1B distal arm).*
- **H5 — train longer.** The LL-gap is still rising at step 5000 — do it and the AUPRC keep improving past the budget? → *Suspect, since the gap already climbs while AUPRC is flat.*
- **H6 — eval ceiling / ground truth.** The distal mendelian positive set is small (n=58) and effects are diffuse; the ceiling for distal mendelian VEP may just be low. → *Diagnostic: compare to exp187 1B distal, a supervised enhancer baseline, or shuffle controls before blaming the gLM.*

### Cheap first diagnostics when revisited

1. **H1 CDS-overlap intersection** — most directly testable and, if true, reframes everything.
2. **exp187 1B distal** — does scale help distal at all? (informs H4.)
3. The ccre arm's **own LL-gap-vs-distal-AUPRC scatter** — is there *any* positive regime, or is it flat throughout?

### Status

Parking lot — lots of follow-up potential; not committing to a hypothesis. Revisit later.

References: #232 (diagonal + trajectory), #8 (LL-gap↔AUPRC framing), #279 (scale-axis VEP anomaly), #227/#228 (v4 dataset build).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp232 follow-up: why doesn't distal (enhancer) VEP improve with training? — hypotheses to investigate #283

Observation (#232)

Two clues that shape the question

Hypotheses (not committing — full list to investigate)

Cheap first diagnostics when revisited

Status

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

exp232 follow-up: why doesn't distal (enhancer) VEP improve with training? — hypotheses to investigate #283

Description

Observation (#232)

Two clues that shape the question

Hypotheses (not committing — full list to investigate)

Cheap first diagnostics when revisited

Status

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions