A machine learning project that predicts SPY (S&P 500 ETF) price movements over the next 3 days using gradient boosting models (XGBoost, LightGBM, CatBoost) trained on 115 engineered features from technical indicators, volume patterns, and cross-asset relationships.
This project implements a comprehensive machine learning pipeline for predicting whether SPY will move up or down over the next 3 trading days. The pipeline includes:
- Data Collection: Automated download of OHLCV data for SPY, VIX, TLT, DXY, and GLD
- Feature Engineering: 115 technical indicators across 5 feature modules
- Model Training: Three gradient boosting algorithms with 2-phase training (all features β selected features)
- Model Evaluation: Comprehensive performance analysis and comparison
- Production Ready: Model loading and prediction examples
| Model | Test ROC AUC | Test Accuracy | Status |
|---|---|---|---|
| LightGBM (40 features) | 0.5546 | 60.54% | β Best |
| CatBoost (40 features) | 0.4953 | 59.77% | |
| XGBoost (40 features) | 0.4750 | 58.62% | |
| Ensemble (soft voting) | 0.4857 | 55.17% | β Worse |
Winner: LightGBM with 40 selected features achieves the best performance.
stockprediction/
βββ README.md # This file
βββ download_data.py # Data download script
βββ download_data.md # Data download documentation
βββ feature_engineering.py # Main feature engineering pipeline
βββ FEATURE_MODULES_README.md # Feature engineering documentation
βββ spy_xgboost_features.md # Feature specifications
β
βββ Feature Modules/
β βββ price_based_features.py # Price, momentum, MA features
β βββ volume_features.py # Volume-based features
β βββ volatility_features.py # Volatility and technical indicators
β βββ cross_asset_features.py # VIX, TLT, DXY, GLD features
β βββ regime_dependent_features.py # Market regime features
β
βββ Model Training Scripts/
β βββ modelling_xgboost.py # XGBoost training
β βββ modelling_lightgbm.py # LightGBM training (BEST)
β βββ modelling_catboost.py # CatBoost training
β βββ modelling_ensemble.py # Ensemble model
β
βββ modelling.md # Model results & usage guide
β
βββ output/
βββ spy_features_full.csv # Generated features
βββ models/
βββ xgboost/ # XGBoost models & metrics
βββ lightgbm/ # LightGBM models & metrics β
βββ catboost/ # CatBoost models & metrics
βββ ensemble/ # Ensemble model & metrics
- download_data.md - Data collection and preparation
- FEATURE_MODULES_README.md - Feature engineering details
- spy_xgboost_features.md - Feature specifications
- modelling.md - Model comparison, results, and usage guide
# Clone the repository
git clone https://github.com/yourusername/stockprediction.git
cd stockprediction
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install pandas numpy yfinance ta-lib
pip install xgboost lightgbm catboost
pip install scikit-learn matplotlib seaborn# Download OHLCV data for SPY, VIX, TLT, DXY, GLD (2015-present)
python download_data.pySee download_data.md for details.
# Run feature engineering pipeline (creates 115 features)
python feature_engineering.pyOutput: output/spy_features_full.csv
See FEATURE_MODULES_README.md for feature details.
# Train LightGBM (recommended)
python modelling_lightgbm.py
# Or train all models
python modelling_xgboost.py
python modelling_catboost.py
python modelling_ensemble.pyModels saved to output/models/{model_name}/
import lightgbm as lgb
import pandas as pd
# Load best model
model = lgb.Booster(model_file='output/models/lightgbm/lightgbm_model_selected_features.json')
# Load required features
with open('output/models/lightgbm/selected_features.txt', 'r') as f:
required_features = [line.strip() for line in f.readlines()]
# Load your data
df = pd.read_csv('output/spy_features_full.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Get latest data
latest = df[required_features].iloc[-1:]
# Predict
prob_up = model.predict(latest)[0]
prediction = "Up" if prob_up > 0.5 else "Down"
print(f"Prediction: {prediction}")
print(f"Probability: {prob_up:.2%}")See modelling.md for complete usage guide.
-
Price-Based Features (19 features)
- Momentum indicators (3d, 5d, 10d, 20d, 50d)
- Moving averages (20, 50, 100, 200-day)
- Price extremes and gaps
-
Volume Features (16 features)
- Volume ratios and z-scores
- Volume trends and spikes
- Accumulation/distribution patterns
-
Volatility Features (30 features)
- ATR, Bollinger Bands, RSI, MACD
- Volatility regimes
- Technical indicator combinations
-
Cross-Asset Features (35 features)
- VIX (volatility index) features
- TLT (bonds), DXY (dollar), GLD (gold)
- Risk-on/risk-off regimes
-
Regime-Dependent Features (15 features)
- Market regime detection
- Hurst exponent (trend persistence)
- Regime-specific indicators
spy_volume_ratio_50d- 50-day volume ratiospy_rsi_14- 14-day RSIspy_macd_pct- MACD percentagespy_volume_ratio_20d- 20-day volume ratiospy_momentum_3d- 3-day momentumvix_spy_ratio- VIX to SPY ratio
Phase 1: All Features (115)
- Train with all engineered features
- Calculate permutation importance on test set
- Identify truly predictive features
Phase 2: Selected Features (40)
- Retrain with top 40 features from permutation importance
- Reduce overfitting and improve generalization
- Compare performance vs full model
- XGBoost: Level-wise tree growth with aggressive regularization
- LightGBM: Leaf-wise tree growth with default parameters β
- CatBoost: Ordered boosting with symmetric trees
- Ensemble: Soft voting (probability averaging) of all three
Metrics (Test Set: 2025-01-02 to 2026-01-16):
- ROC AUC: 0.5546
- Accuracy: 60.54%
- Precision (Up): 61%
- Recall (Up): 96%
- F1 Score: 0.7469
Characteristics:
- High recall for "Up" predictions (catches most upward movements)
- Low precision for "Down" predictions (many false alarms)
- Stopped at iteration 3 (aggressive early stopping)
- Best generalization with feature selection
- Volume features are most predictive across all models
- Feature selection significantly improved LightGBM (+7.46% ROC AUC)
- Early stopping was critical for preventing overfitting
- Market regime shifts impact model performance (train: 2015-2023, test: 2024-2026)
- Limited predictive power for 3-day movements (best AUC ~0.55)
- Not recommended for live trading without additional signals
- Performance may degrade with market regime changes
- Best used as one signal among many
- 3-day prediction horizon is inherently challenging
- Retrain periodically with updated data
- Monitor performance metrics over time
- Implement proper risk management
- Consider ensemble with other strategies
# 1. Update data
python download_data.py
# 2. Regenerate features
python feature_engineering.py
# 3. Retrain models
python modelling_lightgbm.py
python modelling_xgboost.py
python modelling_catboost.py
# 4. Compare results
# Check output/models/{model_name}/metrics.jsonEach model generates:
{model}_model.json- Full model (115 features){model}_model_selected_features.json- Selected model (40 features)permutation_importance.csv- Feature importance scoresselected_features.txt- Top 40 features listfeature_importance_top30.png- Importance visualizationpermutation_importance_top40.png- Permutation importance plottraining_curves.png- AUC curves during trainingmetrics.json- Performance metrics
- Python: 3.11+
- Data: pandas, numpy, yfinance
- ML: xgboost, lightgbm, catboost, scikit-learn
- Visualization: matplotlib, seaborn
- Technical Analysis: ta-lib
- Feature Engineering Guide - Detailed feature descriptions
- Model Comparison & Usage - Complete model analysis and usage examples
- Data Collection - Data sources and preparation
Contributions are welcome! Areas for improvement:
- Feature Engineering: Add sentiment data, options flow, market microstructure
- Model Improvements: Extend prediction horizon, add regime detection
- Alternative Targets: Predict magnitude of moves, not just direction
- Ensemble Methods: Weighted voting, stacking, blending
This project is for educational purposes. Not financial advice.
- Data provided by Yahoo Finance (yfinance)
- Technical indicators from TA-Lib
- Gradient boosting frameworks: XGBoost, LightGBM, CatBoost
Last Updated: 2026-01-17
Best Model: output/models/lightgbm/lightgbm_model_selected_features.json