- Clone this repository
- Install dependencies:
pip install -r requirements.txt - Set up Kaggle API credentials
- Download dataset to
./data/folder
- Navigate to the project directory
- Run:
streamlit run streamlit_app.py - Open your browser to the local URL shown in the terminal (usually http://localhost:8501)
Alternatively, you can run the batch file on Windows: run_streamlit.bat
Run the Jupyter notebooks in the notebooks/ folder for detailed analysis and model training.
The linear regression model is in the featureengineering.ipynb file. The LightGMB model, in the MLModel_Optiver.ipynb file, uses the training table made in the featureengineering.ipynb. To skip the table creation access the table in csv or parquet here : https://drive.google.com/drive/folders/1qbmvTMvfCfUSKyeAJ88yG2WZsUqrCfB4?usp=sharing
An ML model used to predict realized volatility over 10-minute intervals.
This project was developed out of interest in working with real-world competition data from the OPTIVER REALIZED VOLATILITY PREDICTION KAGGLE CHALLENGE.]
-Cross-Validation RMSE: 0.001397 (not directly comparable to leaderboard RMSPE).
-R² Score: ~0.7705
-Training RMSPE: 0.286029
-Validation RMSPE: 0.298207
-Validation R² Score: 0.782160
Feature analysis revealed spread-related features (price_spread_avg_mean, price_spread_1_mean) as the strongest predictors.
Merged and processed Kaggle-provided order book and trade data.
Engineered 46 statistical and volatility-based features per stock-time combination.
Selected top features using correlation with target value.
##Validation Strategy: Sequential split: train on earlier periods, validate on later periods to mimic real-world forecasting.
Avoided random splits to preserve time-series structure.
##Models Used Linear Regression: simple baseline using top-10 correlated features, scaled with StandardScaler.
LightGBM: gradient boosting model using the full feature set for improved accuracy.
RMSE used for internal Linear Regression evaluation.
RMSPE (competition metric) used for LightGBM to match leaderboard scoring.
Kaggle Dataset: https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview
Python
pandas, NumPy (data manipulation)
scikit-learn (Linear Regression, scaling, cross-validation)
LightGBM (gradient boosting model)
matplotlib, seaborn (visualization)
This project was completed in collaboration with:
Benjamin Silva Juan Buitrago Praise Olatide