Hatch Experiments: v1 → v7.1

This document indexes the full experimental record of the Hatch protein disorder classifier. The series progressed through two distinct architectural phases: a sliding window state machine (v1–v6) and a global feature classifier (v7–v7.1). Each version represents a distinct hypothesis, tested against a held-out evaluation set.

Summary Table

Version	Architecture	Accuracy	Mean F1	F1 Folded	F1 Disordered	Test Set	Key Finding
v1	Full-Forgiveness state machine	—	35.2%	34.9%	35.5%	400 seqs	Single passing window resets cup to zero — "Entropy Eraser"
v2	Consecutive Window Consensus Drain	—	46.9%	31.8%	61.9%	400 seqs	+26.4% disordered F1; N=2 is the biophysical sweet spot
v3	Hysteresis Asymmetric Thresholds	—	65.2%	64.3%	66.2%	400 seqs	Hydraulic phase transition at T_fill=3; biggest single leap
v4	AUC-Weighted Feature Scoring	—	70.0%	73.9%	66.1%	400 seqs	Bulky Hydrophobic (AUC=0.838) is the dominant window-level sensor
v5	Coordinate Descent Per-Feature Thresholds	~75%	75.4%	77.0%	73.8%	400 seqs	Proline threshold miscalibrated by 50%; CD converged in 4 passes
v6	Multi-Scale Ensemble + Positional Weighting	~75%	75.4%	77.0%	73.8%	400 seqs	W=30 is the Biophysical Information Horizon; ensemble adds noise
v7	Global classifier, 7 features, k≥5	89.47%	89.41%	90.21%	88.62%	5,680 seqs	+14% jump from architecture change alone; global scale is correct
v7.1	Global classifier, 4 features, unanimous vote	89.70%	89.65%	89.84%	89.43%	5,680 seqs	Dropping 3 weakest features improves accuracy; entropy wins globally

Phase 1: The Sliding Window Series (v1–v6)

The first six versions tested a biophysically-grounded sliding window state machine: a "cup" fills when a window scores below threshold, and overflows to predict DISORDERED. The series exhausted every available lever within this paradigm.

The ceiling was 75.4% mean F1 — and it did not move regardless of parameter tuning, ensemble composition, or feature engineering. The plateau is architectural, not a tuning failure.

The 7 biophysical features used by Hatch are global thermodynamic properties. They describe the average character of a protein, not its local structure. When averaged over 30-residue windows, they are noisy samples of a global quantity. When averaged over the full sequence, they are the actual quantity. The global classifier achieves 89.7% accuracy because it measures these features at the right scale.

See EXP_V6.md for the full analysis of the plateau and the Uversky Scale Mismatch finding.

Phase 2: The Global Classifier (v7–v7.1)

Computing the 7 features over the entire protein sequence instead of 30-residue windows produced a +14.3% jump to 89.47% in a single step. No new features. No new data. The same 7 features, the same thresholds heuristic — just measured at the correct scale.

The v7.1 ablation study then showed that 3 of the 7 features are net noise contributors at the global scale. Dropping them and requiring unanimous agreement among the 4 strong features improved accuracy to 89.70% with a simpler, more interpretable model.

The v7.1 classifier runs in 0.052 ms/sequence — 9,600× faster than IUPred — with no GPU, no external database, and no sequence alignment.

Key Biophysical Findings

The following findings emerged from the experimental record and are independent of the classification performance. They describe properties of the protein sequence-disorder relationship, not of the classifier.

1. The Biophysical Information Horizon is W=30. Below 30 residues, the hydrophobic core signal (AUC=0.838) is too noisy to be discriminative. At W=30, the signal stabilizes. W=40 adds marginal noise by blurring short ordered domains. The multi-scale ensemble experiment (v6) confirmed this by showing that W=20 is a noise injector.

2. N=2 is the minimum sustained order signal. Two consecutive ordered windows at W=30 ≈ 31–35 residues — the exact length of one alpha-helix or a beta-strand pair. The grid search found N=2 independently across multiple parameter sweeps, confirming that the minimum meaningful "sustained order" signal corresponds to one secondary structure element.

3. Bulky Hydrophobic Frequency is the dominant window-level predictor. AUC=0.838, Cohen's d=1.39. Disordered proteins cannot maintain a dense hydrophobic core — it requires stable tertiary structure. Shannon Entropy is second at the window level (AUC=0.785).

4. Shannon Entropy is the dominant global-level predictor. At the full-sequence scale, Shannon Entropy alone achieves 85.51% accuracy, outperforming Bulky Hydrophobic alone (84.24%). The ranking inverts because sequence complexity is the information-theoretic prerequisite for folding: a protein cannot encode a unique 3D structure without sufficient amino acid diversity. See SCIENCE.md for the full analysis.

5. The Uversky Charge-Hydrophobicity Ratio is a macroscopic property. The published AUC of 0.85–0.90 for the Uversky ratio is measured on full-sequence averages. At W=30, a single charged residue dominates the net charge of the window, making the ratio unstable (measured AUC=0.5708). This is a fundamental scale mismatch.

6. The "Full-Forgiveness" failure mode. v1's instant cup reset on a single passing window was destroying disorder memory. A single ordered window in a disordered protein (e.g., a short hydrophobic patch) would erase all accumulated evidence of disorder. The Consecutive Window Consensus Drain (v2) fixed this.

7. The Hydraulic Phase Transition at T_fill=3. Lowering the fill threshold from 4 to 3 produced a discontinuous jump in performance — not a gradual improvement. The model was penalizing "marginal order" windows that represent breathing loops and flexible linkers in folded proteins. This threshold corresponds to approximately 34–43% of maximum fold evidence, a structural invariant that appeared consistently across all parameter sweeps.

8. The dominant feature needs no calibration. Bulky Hydrophobic Frequency threshold did not move during Coordinate Descent optimization at either the window level (v5) or the global level (v7). The midpoint between the folded and disordered medians was already the optimal decision boundary for the most discriminative feature.

9. The 4-feature unanimous vote is robust. Removing any single Tier-1 feature from the unanimous vote costs at most 2.55% accuracy. The four features are measuring the same underlying physical phenomenon (the capacity to form and maintain a hydrophobic core) from different angles. They are correlated, not independent, and the unanimous vote is robust to losing any one of them.

Document Index

Document	Description
EXP_V1.md	v1: Full-Forgiveness baseline — the "Entropy Eraser" failure mode
EXP_V2.md	v2: Consecutive Window Consensus Drain — +26.4% disordered F1
EXP_V3.md	v3: Hysteresis Asymmetric Thresholds — the Hydraulic Phase Transition
EXP_V4.md	v4: AUC-Weighted Feature Scoring — Bulky Hydrophobic is dominant
EXP_V5.md	v5: Coordinate Descent Calibration — plateau entry at 75.4%
EXP_V6.md	v6: Multi-Scale Ensemble + Uversky Ratio — plateau confirmed
EXP_V7.md	v7 + v7.1: Global Classifier + CD + Ablation — 89.70% accuracy
SCIENCE.md	Deep-dive: entropy vs hydrophobicity, scale mismatch, robustness
ROADMAP.md	v8+ plan: Hybrid architecture and proteome scanner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hatch Experiments: v1 → v7.1

Summary Table

Phase 1: The Sliding Window Series (v1–v6)

Phase 2: The Global Classifier (v7–v7.1)

Key Biophysical Findings

Document Index

FilesExpand file tree

EXPERIMENTS.md

Latest commit

History

EXPERIMENTS.md

File metadata and controls

Hatch Experiments: v1 → v7.1

Summary Table

Phase 1: The Sliding Window Series (v1–v6)

Phase 2: The Global Classifier (v7–v7.1)

Key Biophysical Findings

Document Index