|
1 | 1 | # Chapter 14: Machine Learning for Price Prediction |
2 | 2 |
|
| 3 | +## 💥 The 95.9% Performance Gap: When the Same ML Fails Spectacularly |
| 4 | + |
| 5 | +**2020, Renaissance Technologies**. The most successful quantitative hedge fund in history runs two funds using machine learning. Same founders. Same PhDs. Same data infrastructure. Same ML techniques. |
| 6 | + |
| 7 | +**Result:** |
| 8 | +- **Medallion Fund (internal, employees only):** **+76%** in 2020 (one of its best years ever) |
| 9 | +- **RIEF Fund (external investors):** **-19.9%** in 2020 (crushing loss) |
| 10 | + |
| 11 | +**Performance gap: 95.9 percentage points** |
| 12 | + |
| 13 | +How is this possible? |
| 14 | + |
| 15 | +**The Timeline:** |
| 16 | + |
| 17 | +```mermaid |
| 18 | +timeline |
| 19 | + title Renaissance Technologies: The Medallion vs. RIEF Divergence |
| 20 | + section Early Success (1988-2005) |
| 21 | + 1988: Medallion launches (employees only) |
| 22 | + 1988-2004: Medallion averages 66%+ annually |
| 23 | + 2005: RIEF launches (external investors, "give others access to our genius") |
| 24 | + section Growing Divergence (2005-2019) |
| 25 | + 2005-2019: Medallion continues 50-70% annually |
| 26 | + 2005-2019: RIEF returns "relatively mundane" (8-10% annually) |
| 27 | + 2018: Medallion +76%, RIEF +8.5% (68 point gap!) |
| 28 | + section The COVID Crash Reveals All (2020) |
| 29 | + March 2020: Market crashes, VIX hits 82 |
| 30 | + Medallion: Adapts in real-time, **ends year +76%** |
| 31 | + RIEF: Models break, **ends year -19.9%** |
| 32 | + Gap: **95.9 percentage points** in same year |
| 33 | + section Cumulative Damage (2005-2020) |
| 34 | + Dec 2020: RIEF cumulative return -22.62% (15 years!) |
| 35 | + Dec 2020: Medallion cumulative 66%+ annualized maintained |
| 36 | +``` |
| 37 | + |
| 38 | +**Figure 14.0**: The Renaissance paradox. Same company, same ML approach, completely opposite results. The 95.9 percentage point gap in 2020 revealed the critical flaw: **prediction horizon**. |
| 39 | + |
| 40 | +**The Key Difference:** |
| 41 | + |
| 42 | +| Metric | Medallion (Works) | RIEF (Fails) | |
| 43 | +|--------|-------------------|--------------| |
| 44 | +| **Holding period** | Seconds to minutes | 6-12 months | |
| 45 | +| **Predictions per day** | Thousands | 1-2 | |
| 46 | +| **Retraining frequency** | Continuous | Monthly | |
| 47 | +| **2020 Performance** | **+76%** | **-19.9%** | |
| 48 | +| **Strategy capacity** | $10B max | $100B+ | |
| 49 | + |
| 50 | +**What Went Wrong with RIEF?** |
| 51 | + |
| 52 | +1. **Long-horizon overfitting:** |
| 53 | + - ML models predict noise, not signal, beyond ~1 day |
| 54 | + - 6-12 month predictions are pure curve-fitting |
| 55 | + - March 2020: All historical patterns broke instantly |
| 56 | + |
| 57 | +2. **Factor-based risk models:** |
| 58 | + - Hedged using Fama-French factors |
| 59 | + - COVID crash: All factors correlated (risk model useless) |
| 60 | + - Medallion: No hedging, pure statistical edge |
| 61 | + |
| 62 | +3. **Model decay ignored:** |
| 63 | + - Retrained monthly |
| 64 | + - Medallion: Retrains continuously (models decay in hours) |
| 65 | + - By the time RIEF retrains, market already changed |
| 66 | + |
| 67 | +**The Math of Prediction Decay:** |
| 68 | + |
| 69 | +Renaissance's founder Jim Simons (RIP 2024) never published the exact formula, but empirical evidence suggests: |
| 70 | + |
| 71 | +$$P(\text{Accurate Prediction}) \propto \frac{1}{\sqrt{t}}$$ |
| 72 | + |
| 73 | +where $t$ is the prediction horizon. |
| 74 | + |
| 75 | +**Implications:** |
| 76 | +- **1 minute ahead:** High accuracy (Medallion trades here) |
| 77 | +- **1 hour ahead:** Accuracy drops ~8x |
| 78 | +- **1 day ahead:** Accuracy drops ~24x |
| 79 | +- **1 month ahead:** Accuracy drops ~130x (RIEF trades here) |
| 80 | +- **6 months ahead:** Essentially random |
| 81 | + |
| 82 | +**The Lesson:** |
| 83 | + |
| 84 | +> **⚠️ ML Prediction Accuracy Decays Exponentially with Time** |
| 85 | +> |
| 86 | +> - **Medallion's secret:** Trade so fast that predictions don't have time to decay |
| 87 | +> - **RIEF's failure:** Hold so long that predictions become noise |
| 88 | +> - **Your choice:** Can you execute in milliseconds? If no, ML price prediction likely won't work. |
| 89 | +> |
| 90 | +> **The brutal equation:** |
| 91 | +> $$\text{Profit} = \text{Prediction Accuracy} \times \text{Position Size} - \text{Transaction Costs}$$ |
| 92 | +> |
| 93 | +> For daily+ predictions, accuracy → 0.51 (barely better than random). Even with huge size, transaction costs dominate. |
| 94 | +
|
| 95 | +**Why This Matters for Chapter 14:** |
| 96 | + |
| 97 | +Most academic ML trading papers test **daily or weekly predictions**. They report Sharpe ratios of 1.5-2.5. But: |
| 98 | + |
| 99 | +1. **They're overfitting:** Trained on historical data that won't repeat |
| 100 | +2. **They ignore decay:** Assume accuracy persists for months/years |
| 101 | +3. **They skip costs:** Transaction costs often exceed edge |
| 102 | +4. **They fail live:** RIEF is the proof—world's best ML team, -19.9% in 2020 |
| 103 | + |
| 104 | +This chapter will teach you: |
| 105 | +1. **Feature engineering** (time-aware, no leakage) |
| 106 | +2. **Walk-forward validation** (out-of-sample always) |
| 107 | +3. **Model ensembles** (diversify predictions) |
| 108 | +4. **Risk management** (short horizons only, detect regime changes) |
| 109 | + |
| 110 | +But more importantly, it will teach you **why most ML trading research is fairy tales**. |
| 111 | + |
| 112 | +The algorithms that crushed RIEF in 2020 had: |
| 113 | +- ✅ State-of-the-art ML (random forests, gradient boosting, neural networks) |
| 114 | +- ✅ Massive data (decades of tick data) |
| 115 | +- ✅ Nobel Prize-level researchers (Jim Simons, Field Medal mathematicians) |
| 116 | +- ❌ **Wrong time horizon** |
| 117 | + |
| 118 | +You will learn to build ML systems that: |
| 119 | +- ✅ Trade intraday only (< 1 day holding periods) |
| 120 | +- ✅ Retrain continuously (models decay fast) |
| 121 | +- ✅ Detect regime changes (COVID scenario) |
| 122 | +- ✅ Walk-forward validate (never trust in-sample) |
| 123 | +- ✅ Correct for multiple testing (feature selection bias) |
| 124 | + |
| 125 | +The ML is powerful. The data is vast. But without respecting prediction decay, you're Renaissance RIEF: -19.9% while your competitors make +76%. |
| 126 | + |
| 127 | +Let's dive in. |
| 128 | + |
3 | 129 | --- |
4 | 130 |
|
5 | 131 | ## Introduction |
@@ -700,3 +826,162 @@ Machine learning is not a silver bullet—it's a power tool that, like any tool, |
700 | 826 | 8. Bailey, D.H., et al. (2014). "Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting." *Notices of the AMS*, 61(5), 458-471. |
701 | 827 | 9. Krauss, C., Do, X.A., & Huck, N. (2017). "Deep Neural Networks, Gradient-Boosted Trees, Random Forests: Statistical Arbitrage on the S&P 500." *European Journal of Operational Research*, 259(2), 689-702. |
702 | 828 | 10. Moody, J., & Saffell, M. (2001). "Learning to Trade via Direct Reinforcement." *IEEE Transactions on Neural Networks*, 12(4), 875-889. |
| 829 | +--- |
| 830 | + |
| 831 | +## 14.8 Machine Learning Disasters and Lessons |
| 832 | + |
| 833 | +Beyond Renaissance RIEF's failure, ML trading has a graveyard of disasters. Understanding these prevents repe |
| 834 | + |
| 835 | +ating them. |
| 836 | + |
| 837 | +### 14.8.1 The Replication Crisis: 95% of Papers Don't Work |
| 838 | + |
| 839 | +**The Problem:** |
| 840 | +- Only **5% of AI papers** share code + data |
| 841 | +- Less than **33% of papers** are reproducible |
| 842 | +- **Data leakage** everywhere (look-ahead bias, target leakage, train/test contamination) |
| 843 | + |
| 844 | +**Impact:** When leakage is fixed, **MSE increases 70%**. Academic papers report Sharpe 2-3x higher than reality. |
| 845 | + |
| 846 | +**Common Leakage Patterns:** |
| 847 | +1. **Normalize on full dataset** (future leaks into past) |
| 848 | +2. **Feature selection on test data** (selection bias) |
| 849 | +3. **Target variable in features** (perfect prediction, zero out-sample) |
| 850 | +4. **Train/test temporal overlap** (tomorrow's data in today's model) |
| 851 | + |
| 852 | +**The Lesson:** |
| 853 | +> **💡 95% of Academic ML Trading Papers Are Fairy Tales** |
| 854 | +> |
| 855 | +> Trust nothing without: |
| 856 | +> - Shared code (GitHub) |
| 857 | +> - Walk-forward validation (strict temporal separation) |
| 858 | +> - Transaction costs modeled |
| 859 | +> - Out-of-sample period > 2 years |
| 860 | +
|
| 861 | +### 14.8.2 Feature Selection Bias: 1000 Features → 0 Work |
| 862 | + |
| 863 | +**The Pattern:** |
| 864 | +1. Generate 1,000 technical indicators |
| 865 | +2. Test correlation with returns |
| 866 | +3. Keep top 20 "predictive" features |
| 867 | +4. Train model on those 20 |
| 868 | +5. Backtest: Sharpe 2.0! (in-sample) |
| 869 | +6. Trade live: Sharpe 0.1 (out-sample) |
| 870 | + |
| 871 | +**Why It Fails:** |
| 872 | +With 1,000 random features and α=0.05, expect 50 false positives by chance. Those 20 "best" features worked on historical data **by luck**, not signal. |
| 873 | + |
| 874 | +**Fix: Bonferroni Correction** |
| 875 | +- Testing 1,000 features? → α_adj = 0.05 / 1000 = 0.00005 |
| 876 | +- Most "predictive" features disappear with correct threshold |
| 877 | + |
| 878 | +**The Lesson:** |
| 879 | +> **⚠️ Multiple Testing Correction Is NOT Optional** |
| 880 | +> |
| 881 | +> If testing N features, divide significance threshold by N. |
| 882 | +> Expect 95% of "predictive" features to vanish. |
| 883 | +
|
| 884 | +### 14.8.3 COVID-19: When Training Data Becomes Obsolete |
| 885 | + |
| 886 | +**March 2020:** |
| 887 | +- VIX spikes from 15 → 82 (vs. 80 in 2008) |
| 888 | +- Correlations break (all assets correlated) |
| 889 | +- Volatility targeting strategies lose 20-40% |
| 890 | + |
| 891 | +**The Problem:** |
| 892 | +Models trained on 2010-2019 data assumed: |
| 893 | +- VIX stays <30 |
| 894 | +- Correlations stable |
| 895 | +- Liquidity always available |
| 896 | + |
| 897 | +March 2020 violated ALL assumptions simultaneously. |
| 898 | + |
| 899 | +**The Lesson:** |
| 900 | +> **💡 Regime Changes Invalidate Historical Patterns Instantly** |
| 901 | +> |
| 902 | +> Defense: |
| 903 | +> - Online learning (retrain daily) |
| 904 | +> - Regime detection (HMM, change-point detection) |
| 905 | +> - Reduce size when volatility spikes |
| 906 | +> - Have a "shut down" mode |
| 907 | +
|
| 908 | +--- |
| 909 | + |
| 910 | +## 14.9 Summary and Key Takeaways |
| 911 | + |
| 912 | +ML for price prediction is powerful but fragile. Success requires understanding its severe limitations. |
| 913 | + |
| 914 | +### What Works: |
| 915 | + |
| 916 | +✅ **Short horizons:** < 1 day (Medallion +76%), not months (RIEF -19.9%) |
| 917 | +✅ **Ensembles:** RF + GBM + LASSO > any single model |
| 918 | +✅ **Walk-forward:** Always out-of-sample, retrain frequently |
| 919 | +✅ **Bonferroni correction:** For feature selection with N tests |
| 920 | +✅ **Regime detection:** Detect when model breaks, reduce/stop trading |
| 921 | + |
| 922 | +### What Fails: |
| 923 | + |
| 924 | +❌ **Long horizons:** RIEF -19.9% while Medallion +76% (same company!) |
| 925 | +❌ **Static models:** COVID killed all pre-2020 models |
| 926 | +❌ **Data leakage:** 95% of papers unreproducible, 70% MSE increase when fixed |
| 927 | +❌ **Feature mining:** 1000 features → 20 "work" → 0 work out-of-sample |
| 928 | +❌ **Academic optimism:** Papers report Sharpe 2-3x higher than reality |
| 929 | + |
| 930 | +### Disaster Prevention Checklist: |
| 931 | + |
| 932 | +1. **Short horizons only:** Max 1 day hold (preferably < 1 hour) |
| 933 | +2. **Walk-forward always:** NEVER optimize on test data |
| 934 | +3. **Expanding window preprocessing:** Normalize only on past data |
| 935 | +4. **Bonferroni correction:** α = 0.05 / num_features_tested |
| 936 | +5. **Regime detection:** Monitor prediction error, retrain when drift |
| 937 | +6. **Ensemble models:** Never rely on single model |
| 938 | +7. **Position limits:** 3% max, scale by prediction confidence |
| 939 | + |
| 940 | +**Cost:** $500-2000/month (compute, data, retraining) |
| 941 | +**Benefit:** Avoid -19.9% (RIEF), -40% (COVID), Sharpe collapse (leakage) |
| 942 | + |
| 943 | +### Realistic Expectations (2024): |
| 944 | + |
| 945 | +- **Sharpe ratio:** 0.6-1.2 (intraday ML), 0.2-0.5 (daily+ ML) |
| 946 | +- **Degradation:** Expect 50-60% in-sample → out-sample Sharpe drop |
| 947 | +- **Win rate:** 52-58% (barely better than random) |
| 948 | +- **Decay speed:** Retrain monthly minimum, weekly preferred |
| 949 | +- **Capital required:** $25k+ (diversification, transaction costs) |
| 950 | + |
| 951 | +--- |
| 952 | + |
| 953 | +## 14.10 Exercises |
| 954 | + |
| 955 | +**1. Walk-Forward Validation:** Implement expanding-window backtesting, measure Sharpe degradation |
| 956 | + |
| 957 | +**2. Data Leakage Detection:** Find look-ahead bias in normalization code |
| 958 | + |
| 959 | +**3. Bonferroni Correction:** Test 100 random features, apply correction—how many survive? |
| 960 | + |
| 961 | +**4. Regime Detection:** Implement HMM to detect when model accuracy degrades |
| 962 | + |
| 963 | +**5. Renaissance Simulation:** Compare 1-minute vs. 1-month holding—does accuracy decay? |
| 964 | + |
| 965 | +--- |
| 966 | + |
| 967 | +## 14.11 References (Expanded) |
| 968 | + |
| 969 | +**Disasters:** |
| 970 | +- Renaissance Technologies RIEF vs. Medallion performance (2005-2020) |
| 971 | +- Kapoor & Narayanan (2023). "Leakage and the Reproducibility Crisis in ML-based Science" |
| 972 | + |
| 973 | +**Academic Foundations:** |
| 974 | +- Gu, Kelly, Xiu (2020). "Empirical Asset Pricing via Machine Learning." *Review of Financial Studies* |
| 975 | +- Fischer & Krauss (2018). "Deep Learning with LSTM for Daily Stock Returns" |
| 976 | +- Bailey et al. (2014). "Pseudo-Mathematics and Financial Charlatanism" |
| 977 | + |
| 978 | +**Replication Crisis:** |
| 979 | +- Harvey, Liu, Zhu (2016). "...and the Cross-Section of Expected Returns" (multiple testing) |
| 980 | + |
| 981 | +**Practitioner:** |
| 982 | +- "Machine Learning Volatility Forecasting: Avoiding the Look-Ahead Trap" (2024) |
| 983 | +- "Overfitting and Its Impact on the Investor" (Man Group, 2021) |
| 984 | + |
| 985 | +--- |
| 986 | + |
| 987 | +**End of Chapter 14** |
0 commit comments