You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ccre_non_promoter (enhancer) arm is the lone exception to the diagonal-wins + LL-gap↔AUPRC story. Its distal AUPRC sits at ~0.10–0.13 across the entire training trajectory with no learning trend, and distal is the only subset with a ~0 / negative LL-gap↔AUPRC correlation (Pearson −0.10).
Two clues that shape the question
The constraint signal grows; the AUPRC doesn't. The arm's own val_enhancer LL-gap rises 0.037 → 0.098 over 5000 steps — so it is learning some enhancer constraint — yet distal AUPRC stays flat. The bottleneck is the constraint → distal-VEP transfer, not a total failure to learn. (For scale: cds's val_cds gap goes 0.007 → 0.41.)
The ccre arm does better off-diagonal on coding than on its own target. From the diagonal it scores splicing 0.238 and synonymous 0.179 — both higher than its own distal 0.127. It is a better splicing predictor than enhancer predictor.
Hypotheses (not committing — full list to investigate)
H1 — training-set contamination via conservation filtering (lead hypothesis). After conservation filtering, the ccre_non_promoter partition may be enriched for conserved distal cCREs that overlap or border CDS exons (i.e. sit right at the CDS edge), so the arm partly trains on coding sequence. This would directly explain its strong splicing/synonymous off-diagonal and its weak own-distal signal. → Diagnostic: intersect the v4 ccre_non_promoter training intervals with CDS/exon annotations; quantify CDS overlap and CDS-edge proximity. If confirmed, it spins off a v4 data-curation fix.
H2 — heterogeneous target.ccre_non_promoter mixes enhancers + insulators/CTCF + other distal cCREs → a diffuse training target with no sharp enhancer-specific constraint. → Diagnostic: break the partition down by cCRE subtype; consider an enhancer-only arm.
H3 — weak / fundamental signal. Enhancers are intrinsically less constrained than coding/splice sites, so there's simply less for a gLM to exploit (gap 0.098 vs cds 0.41). May be a ceiling, not a fixable bug.
H5 — train longer. The LL-gap is still rising at step 5000 — do it and the AUPRC keep improving past the budget? → Suspect, since the gap already climbs while AUPRC is flat.
H6 — eval ceiling / ground truth. The distal mendelian positive set is small (n=58) and effects are diffuse; the ceiling for distal mendelian VEP may just be low. → Diagnostic: compare to exp187 1B distal, a supervised enhancer baseline, or shuffle controls before blaming the gLM.
Cheap first diagnostics when revisited
H1 CDS-overlap intersection — most directly testable and, if true, reframes everything.
exp187 1B distal — does scale help distal at all? (informs H4.)
The ccre arm's own LL-gap-vs-distal-AUPRC scatter — is there any positive regime, or is it flat throughout?
Status
Parking lot — lots of follow-up potential; not committing to a hypothesis. Revisit later.
Observation (#232)
The
ccre_non_promoter(enhancer) arm is the lone exception to the diagonal-wins + LL-gap↔AUPRC story. Its distal AUPRC sits at ~0.10–0.13 across the entire training trajectory with no learning trend, and distal is the only subset with a ~0 / negative LL-gap↔AUPRC correlation (Pearson −0.10).Two clues that shape the question
val_enhancerLL-gap rises 0.037 → 0.098 over 5000 steps — so it is learning some enhancer constraint — yet distal AUPRC stays flat. The bottleneck is the constraint → distal-VEP transfer, not a total failure to learn. (For scale: cds'sval_cdsgap goes 0.007 → 0.41.)Hypotheses (not committing — full list to investigate)
ccre_non_promoterpartition may be enriched for conserved distal cCREs that overlap or border CDS exons (i.e. sit right at the CDS edge), so the arm partly trains on coding sequence. This would directly explain its strong splicing/synonymous off-diagonal and its weak own-distal signal. → Diagnostic: intersect the v4ccre_non_promotertraining intervals with CDS/exon annotations; quantify CDS overlap and CDS-edge proximity. If confirmed, it spins off a v4 data-curation fix.ccre_non_promotermixes enhancers + insulators/CTCF + other distal cCREs → a diffuse training target with no sharp enhancer-specific constraint. → Diagnostic: break the partition down by cCRE subtype; consider an enhancer-only arm.Cheap first diagnostics when revisited
Status
Parking lot — lots of follow-up potential; not committing to a hypothesis. Revisit later.
References: #232 (diagonal + trajectory), #8 (LL-gap↔AUPRC framing), #279 (scale-axis VEP anomaly), #227/#228 (v4 dataset build).