Skip to content

kmesiab/hatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hatch — Biophysical Protein Disorder Classifier

Disorder signals accumulate like water in a cup. Sustained, uncancelled disorder overflows it. Pure biophysics — no neural network, no sequence alignment, no training database.

89.70% accuracy. 0.052 ms per sequence. Four features. Pure arithmetic.

Forked from kmesiab/concept-model-experiment-protein-folding as an experimental implementation of the Concept Model (4-Layer Matrix) framework.


Performance

Version Architecture Mean F1 F1 Folded F1 Disordered Test Set
v1 Sliding window, full forgiveness 35.2% 34.9% 35.5% 400 seqs
v2 Consensus drain 46.9% 31.8% 61.9% 400 seqs
v3 Hysteresis thresholds 65.2% 64.3% 66.2% 400 seqs
v4 AUC-weighted scoring 70.0% 73.9% 66.1% 400 seqs
v5 Coordinate Descent calibration 75.4% 77.0% 73.8% 400 seqs
v6 Multi-scale ensemble 75.4% 77.0% 73.8% 400 seqs
v7 Global classifier + CD (7 features, k≥5) 89.4% 90.2% 88.6% 5,680 seqs
v7.1 4-feature unanimous vote 89.7% 89.8% 89.6% 5,680 seqs

The sliding window approach plateaued at 75.4% mean F1 after six versions of architectural improvement. A single architectural change — computing features over the full sequence instead of 30-residue windows — produced a +14% jump to 89.4%. A subsequent ablation study showed the 3 weakest features add noise rather than signal: dropping them and requiring unanimous agreement among the 4 strong features produces the best result at 89.70% with a simpler model.

HATCH v1→v7.1 Progression


Quick Start

# v7.1 — recommended (4 features, unanimous vote, 89.70% accuracy)
from inference_engine_v7_1 import load_v71_classifier

clf = load_v71_classifier()

# Hemoglobin alpha (folded)
result = clf.classify("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPH...")
# {'prediction': 'FOLDED', 'conditions_met': 4, 'confidence': 1.0, 'elapsed_ms': 0.052}

# p53 TAD (disordered)
result = clf.classify("MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLML...")
# {'prediction': 'DISORDERED', 'conditions_met': 2, 'confidence': 0.50, 'elapsed_ms': 0.041}

The Features

Hatch classifies proteins using biophysical properties computed as whole-protein averages. An ablation study identified a clean two-tier structure:

Tier 1 — The Four Signal Features (v7.1)

A protein is predicted FOLDED if all 4 conditions are met (unanimous vote). These 4 features alone achieve 89.70% accuracy.

Feature Window AUC Direction CD Threshold Biological Rationale
Bulky Hydrophobic Freq 0.838 high=folded 0.2607 Hydrophobic core packing requires W, C, F, Y, I, V, L
Shannon Entropy 0.785 high=folded 3.8419 Folded proteins require diverse amino acid composition
Flexibility 0.778 low=folded 0.8345 Rigid backbone is a prerequisite for stable tertiary structure
Hydrophobicity 0.764 high=folded 0.4030 Hydrophobic collapse drives folding

Tier 2 — Noise Features (dropped in v7.1)

These 3 features have AUC ≤ 0.62 at the window level. Including them in the vote dilutes the signal from Tier 1 and slightly reduces overall accuracy.

Feature Window AUC Why Dropped
Proline Frequency 0.617 Marginal discriminative power; adds noise to the vote
Net Charge 0.599 Single charged residues dominate at short sequence lengths
H-Bond Potential 0.598 Barely above random; 3.45× weaker than Bulky Hydrophobic

Ablation Results

Configuration Accuracy Mean F1
Bulky Hydrophobic alone 84.24% 84.24%
Top 4 features, k≥3 86.46% 86.15%
Top 4 features, k≥4 (unanimous) 89.70% 89.70%
Full 7 features, k≥5 (v7) 89.47% 89.41%

The Hydrophobic Core alone — a single threshold on a single feature — already outperforms PONDR-FIT (81%), IUPred (80%), and DISOPRED3 (82%). The other three Tier 1 features add the remaining 5% by catching edge cases where hydrophobic composition is atypical.

AUC values measured on 794,870 training windows (646,623 folded, 148,247 disordered).

Single-Feature Performance and Window AUC vs Global Accuracy

The Scale Inversion: Shannon Entropy wins globally (85.51% solo) while Bulky Hydrophobic wins locally (AUC=0.838 at W=30). The same two features, the same proteins — the ranking inverts depending on the scale of measurement. Entropy is the global blueprint. Hydrophobicity is the local engine. See docs/SCIENCE.md for the full analysis.

Feature Robustness: Remove One Tier-1 Feature


The v7.1 Architecture

# Whole-protein global feature averages
features = compute_features(seq)   # 4-element float vector

# Unanimous vote: all 4 conditions must be met
conditions_met = sum(
    features[feat] >= threshold[feat] if folded_is_high[feat]
    else features[feat] <= threshold[feat]
    for feat in FEATURE_NAMES
)

return 'FOLDED' if conditions_met == 4 else 'DISORDERED'

O(n) time. No GPU. No external database. No sequence alignment. The simplest correct version of HATCH.


The Experiment: v1 → v7

This classifier was not designed from scratch. It was discovered through seven iterations of controlled experimentation, each testing a specific hypothesis about what was limiting performance.

v1: The "Entropy Eraser" (35.2% Mean F1)

The first classifier used a sliding window state machine: a "cup" fills when a window scores below threshold, and overflows to predict DISORDERED. The critical flaw was Full Forgiveness — a single passing window reset the cup to zero instantly, destroying all accumulated disorder evidence the moment the sequence encountered one ordered patch.

v2: Consecutive Window Consensus Drain (+11.7%)

Hypothesis: A single passing window should not erase disorder evidence. Sustained order requires multiple consecutive passing windows.

The cup now only drains after N=2 consecutive passing windows. Mean F1 jumped from 35.2% to 46.9%. Disordered F1 improved by +26.4%. The N=2 optimum is biologically meaningful: two consecutive W=30 windows represent ~31–35 residues — the length of a stable alpha-helix or beta-strand pair. The data independently confirmed that the minimum "sustained order" signal is approximately one secondary structure element.

v3: Hysteresis Asymmetric Thresholds (+18.3%)

Hypothesis: Marginal order (score=3/7) should stop the cup from filling even if it cannot drain it.

Decoupling fill and drain thresholds (T_fill=3, T_drain=2) prevents the system from "chattering" at the decision boundary — the engineering concept of hysteresis applied to biophysical classification. Mean F1 jumped from 46.9% to 65.2%. The dominant driver was T_fill=3 — a phase transition, not a gradual improvement.

v4: AUC-Weighted Feature Scoring (+4.8%)

Hypothesis: H-Bond Potential (AUC=0.598) should not have the same vote as Bulky Hydrophobic Frequency (AUC=0.838).

Replacing the binary feature sum with AUC-weighted scores improved Mean F1 from 65.2% to 70.0%. The unexpected finding: Bulky Hydrophobic Frequency, not Shannon Entropy, is the single most discriminative feature. The Hydrophobic Core signal dominates because disordered proteins structurally cannot maintain a dense hydrophobic core.

v5: Coordinate Descent Per-Feature Calibration (+1.4%)

Coordinate Descent over 7 per-feature thresholds (25 candidates each, ±30% of distribution overlap) improved Mean F1 from 70.0% to 75.4%. The dominant shift: proline_freq threshold dropped 50% — the midpoint was treating proteins with 3–5% proline as "disordered" when low proline is actually a folded signal.

v6: The Wall

Multi-scale ensembles (W=20/30/40), positional terminal de-weighting, and the Uversky charge-hydrophobicity ratio all failed to improve performance. The Uversky ratio — which achieves AUC=0.85–0.90 at the full-sequence level in the literature — scored only AUC=0.57 at W=30. The signal is macroscopic.

The architectural conclusion: The sliding window ceiling is not a tuning failure. The 7 biophysical features are global thermodynamic properties. At W=30, a single charged residue can dominate the net_charge of a window. The sliding window approach was treating local fluctuations as global disorder evidence.

v7: Global Classifier + Coordinate Descent (+14.0%)

Computing the 7 features over the entire protein sequence instead of 30-residue windows produced a +14% jump to 89.4% mean F1. The CD converged in just 2 passes (vs 4 passes for the window classifier) because the full-sequence averaging already removes the inter-feature correlations that required multiple passes to resolve at the window level.

Key CD finding: Bulky Hydrophobic Frequency threshold did not move at all. The dominant feature was already perfectly calibrated at the midpoint. All gains came from secondary features where the midpoints were systematically biased — H-Bond Potential (−0.0575) and Shannon Entropy (+0.0780).


Key Scientific Findings

1. The Biophysical Information Horizon. The sliding window classifier has a hard ceiling at ~75% mean F1 for this feature set. The ceiling exists because the 7 features are macroscopic thermodynamic properties that require full-sequence averaging to be discriminative. This is the "Scale Mismatch": the Uversky ratio achieves AUC=0.85 at the protein level but only AUC=0.57 at W=30.

2. The Hydrophobic Core is the dominant signal. Bulky Hydrophobic Frequency (AUC=0.838, Cohen's d=1.39) is the single most discriminative feature. Disordered proteins cannot maintain a dense hydrophobic core because it requires stable tertiary structure. This is a first-principles measurement of a known biochemical principle.

3. The N=2 structural invariant. The optimal consecutive window requirement is N=2, corresponding to ~31–35 residues — the length of one secondary structure element. The grid search found this independently across multiple parameter sweeps.

4. Coordinate Descent converges faster on global features. 4 passes for window-level calibration vs 2 passes for global calibration. The full-sequence averaging removes the inter-feature correlations that required multiple passes to resolve at the window level.

5. The dominant feature needs no calibration. Bulky Hydrophobic Frequency threshold did not move during CD optimization. The midpoint was already the optimal decision boundary for the most discriminative feature. The physics was right; we just needed to measure it carefully enough to see it.

6. Entropy is the global blueprint; hydrophobicity is the local engine. Shannon Entropy alone achieves 85.51% global accuracy, outperforming Bulky Hydrophobic alone (84.24%) — despite Bulky Hydrophobic having higher window-level AUC (0.838 vs 0.785). The ranking inverts at the global scale because sequence complexity is the information-theoretic prerequisite for folding: a protein cannot encode a unique 3D structure without sufficient amino acid diversity. See docs/SCIENCE.md.


Experiment Documentation

Full per-version experiment documentation is in docs/EXPERIMENTS.md.

Document Description
docs/EXPERIMENTS.md Full experiment index with summary table and key findings
docs/EXP_V1.md v1: Full-Forgiveness baseline — the "Entropy Eraser" failure mode
docs/EXP_V2.md v2: Consecutive Window Consensus Drain — +26.4% disordered F1
docs/EXP_V3.md v3: Hysteresis Asymmetric Thresholds — the Hydraulic Phase Transition
docs/EXP_V4.md v4: AUC-Weighted Feature Scoring — Bulky Hydrophobic is dominant
docs/EXP_V5.md v5: Coordinate Descent Calibration — plateau entry at 75.4%
docs/EXP_V6.md v6: Multi-Scale Ensemble + Uversky Ratio — plateau confirmed
docs/EXP_V7.md v7: Global Classifier + CD — 89.47% accuracy
docs/SCIENCE.md Deep-dive: entropy vs hydrophobicity, scale mismatch, robustness analysis
docs/ROADMAP.md v8+ plan: Hybrid architecture targeting 91%+

Repository Structure

hatch/
├── inference_engine_v7_1.py     ← Production classifier (use this) ✓
├── inference_engine_v7.py       ← 7-feature global classifier
├── inference_engine_v5.py       ← Best sliding window classifier
├── global_classifier.py         ← Global classifier training + evaluation
├── global_cd_optimizer.py       ← Coordinate Descent optimizer (global)
├── training_engine.py           ← Feature computation + windowed training
├── feature_importance_scan.py   ← Per-feature AUC scan on 794,870 windows
├── feature_threshold_optimizer.py ← Coordinate Descent optimizer (windows)
├── global_thresholds_v7_cd.json ← CD-optimized thresholds (v7)
├── optimized_thresholds_v5.json ← CD-optimized thresholds (v5)
├── hatch_v7_final.png           ← Final experiment summary visualization
└── docs/
    ├── EXPERIMENTS.md           ← Experiment index + summary table
    ├── SCIENCE.md               ← Deep-dive: entropy vs hydrophobicity finding
    ├── ROADMAP.md               ← v8+ directions
    └── EXP_V1.md → EXP_V7.md   ← Per-version experiment docs

Install & Run

pip install numpy pandas scikit-learn matplotlib requests

# Classify a sequence (v7.1 — production, recommended)
python3 inference_engine_v7_1.py

# Run the full v7 pipeline (downloads data, trains, evaluates)
python3 global_classifier.py
python3 global_cd_optimizer.py

# Reproduce sliding window experiments
python3 grid_search_v5.py

Data Sources

  • Folded: RCSB PDB — representative non-redundant chains
  • Disordered: DisProt — experimentally validated intrinsically disordered proteins

Reference

If you use this code, please credit this repository and the original Concept Model framework:

Mesiab, K. (2024). Emergent Concept Modeling: A Paradigm Shift in AI. Substack


Built with first principles, not black boxes.

About

Hatch: Scale-Aware Hybrid Protein Disorder Classifier with Water-in-the-Cup state machine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors