This document indexes the full experimental record of the Hatch protein disorder classifier. The series progressed through two distinct architectural phases: a sliding window state machine (v1–v6) and a global feature classifier (v7–v7.1). Each version represents a distinct hypothesis, tested against a held-out evaluation set.
| Version | Architecture | Accuracy | Mean F1 | F1 Folded | F1 Disordered | Test Set | Key Finding |
|---|---|---|---|---|---|---|---|
| v1 | Full-Forgiveness state machine | — | 35.2% | 34.9% | 35.5% | 400 seqs | Single passing window resets cup to zero — "Entropy Eraser" |
| v2 | Consecutive Window Consensus Drain | — | 46.9% | 31.8% | 61.9% | 400 seqs | +26.4% disordered F1; N=2 is the biophysical sweet spot |
| v3 | Hysteresis Asymmetric Thresholds | — | 65.2% | 64.3% | 66.2% | 400 seqs | Hydraulic phase transition at T_fill=3; biggest single leap |
| v4 | AUC-Weighted Feature Scoring | — | 70.0% | 73.9% | 66.1% | 400 seqs | Bulky Hydrophobic (AUC=0.838) is the dominant window-level sensor |
| v5 | Coordinate Descent Per-Feature Thresholds | ~75% | 75.4% | 77.0% | 73.8% | 400 seqs | Proline threshold miscalibrated by 50%; CD converged in 4 passes |
| v6 | Multi-Scale Ensemble + Positional Weighting | ~75% | 75.4% | 77.0% | 73.8% | 400 seqs | W=30 is the Biophysical Information Horizon; ensemble adds noise |
| v7 | Global classifier, 7 features, k≥5 | 89.47% | 89.41% | 90.21% | 88.62% | 5,680 seqs | +14% jump from architecture change alone; global scale is correct |
| v7.1 | Global classifier, 4 features, unanimous vote | 89.70% | 89.65% | 89.84% | 89.43% | 5,680 seqs | Dropping 3 weakest features improves accuracy; entropy wins globally |
The first six versions tested a biophysically-grounded sliding window state machine: a "cup" fills when a window scores below threshold, and overflows to predict DISORDERED. The series exhausted every available lever within this paradigm.
The ceiling was 75.4% mean F1 — and it did not move regardless of parameter tuning, ensemble composition, or feature engineering. The plateau is architectural, not a tuning failure.
The 7 biophysical features used by Hatch are global thermodynamic properties. They describe the average character of a protein, not its local structure. When averaged over 30-residue windows, they are noisy samples of a global quantity. When averaged over the full sequence, they are the actual quantity. The global classifier achieves 89.7% accuracy because it measures these features at the right scale.
See EXP_V6.md for the full analysis of the plateau and the Uversky Scale Mismatch finding.
Computing the 7 features over the entire protein sequence instead of 30-residue windows produced a +14.3% jump to 89.47% in a single step. No new features. No new data. The same 7 features, the same thresholds heuristic — just measured at the correct scale.
The v7.1 ablation study then showed that 3 of the 7 features are net noise contributors at the global scale. Dropping them and requiring unanimous agreement among the 4 strong features improved accuracy to 89.70% with a simpler, more interpretable model.
The v7.1 classifier runs in 0.052 ms/sequence — 9,600× faster than IUPred — with no GPU, no external database, and no sequence alignment.
The following findings emerged from the experimental record and are independent of the classification performance. They describe properties of the protein sequence-disorder relationship, not of the classifier.
1. The Biophysical Information Horizon is W=30. Below 30 residues, the hydrophobic core signal (AUC=0.838) is too noisy to be discriminative. At W=30, the signal stabilizes. W=40 adds marginal noise by blurring short ordered domains. The multi-scale ensemble experiment (v6) confirmed this by showing that W=20 is a noise injector.
2. N=2 is the minimum sustained order signal. Two consecutive ordered windows at W=30 ≈ 31–35 residues — the exact length of one alpha-helix or a beta-strand pair. The grid search found N=2 independently across multiple parameter sweeps, confirming that the minimum meaningful "sustained order" signal corresponds to one secondary structure element.
3. Bulky Hydrophobic Frequency is the dominant window-level predictor. AUC=0.838, Cohen's d=1.39. Disordered proteins cannot maintain a dense hydrophobic core — it requires stable tertiary structure. Shannon Entropy is second at the window level (AUC=0.785).
4. Shannon Entropy is the dominant global-level predictor. At the full-sequence scale, Shannon Entropy alone achieves 85.51% accuracy, outperforming Bulky Hydrophobic alone (84.24%). The ranking inverts because sequence complexity is the information-theoretic prerequisite for folding: a protein cannot encode a unique 3D structure without sufficient amino acid diversity. See SCIENCE.md for the full analysis.
5. The Uversky Charge-Hydrophobicity Ratio is a macroscopic property. The published AUC of 0.85–0.90 for the Uversky ratio is measured on full-sequence averages. At W=30, a single charged residue dominates the net charge of the window, making the ratio unstable (measured AUC=0.5708). This is a fundamental scale mismatch.
6. The "Full-Forgiveness" failure mode. v1's instant cup reset on a single passing window was destroying disorder memory. A single ordered window in a disordered protein (e.g., a short hydrophobic patch) would erase all accumulated evidence of disorder. The Consecutive Window Consensus Drain (v2) fixed this.
7. The Hydraulic Phase Transition at T_fill=3. Lowering the fill threshold from 4 to 3 produced a discontinuous jump in performance — not a gradual improvement. The model was penalizing "marginal order" windows that represent breathing loops and flexible linkers in folded proteins. This threshold corresponds to approximately 34–43% of maximum fold evidence, a structural invariant that appeared consistently across all parameter sweeps.
8. The dominant feature needs no calibration. Bulky Hydrophobic Frequency threshold did not move during Coordinate Descent optimization at either the window level (v5) or the global level (v7). The midpoint between the folded and disordered medians was already the optimal decision boundary for the most discriminative feature.
9. The 4-feature unanimous vote is robust. Removing any single Tier-1 feature from the unanimous vote costs at most 2.55% accuracy. The four features are measuring the same underlying physical phenomenon (the capacity to form and maintain a hydrophobic core) from different angles. They are correlated, not independent, and the unanimous vote is robust to losing any one of them.
| Document | Description |
|---|---|
| EXP_V1.md | v1: Full-Forgiveness baseline — the "Entropy Eraser" failure mode |
| EXP_V2.md | v2: Consecutive Window Consensus Drain — +26.4% disordered F1 |
| EXP_V3.md | v3: Hysteresis Asymmetric Thresholds — the Hydraulic Phase Transition |
| EXP_V4.md | v4: AUC-Weighted Feature Scoring — Bulky Hydrophobic is dominant |
| EXP_V5.md | v5: Coordinate Descent Calibration — plateau entry at 75.4% |
| EXP_V6.md | v6: Multi-Scale Ensemble + Uversky Ratio — plateau confirmed |
| EXP_V7.md | v7 + v7.1: Global Classifier + CD + Ablation — 89.70% accuracy |
| SCIENCE.md | Deep-dive: entropy vs hydrophobicity, scale mismatch, robustness |
| ROADMAP.md | v8+ plan: Hybrid architecture and proteome scanner |