Stock Volatility Prediction - AI4ALL Group Project

Setup Instructions

Clone this repository
Install dependencies: pip install -r requirements.txt
Set up Kaggle API credentials
Download dataset to ./data/ folder

Running the Streamlit App

Navigate to the project directory
Run: streamlit run streamlit_app.py
Open your browser to the local URL shown in the terminal (usually http://localhost:8501)

Alternatively, you can run the batch file on Windows: run_streamlit.bat

Data Analysis

Run the Jupyter notebooks in the notebooks/ folder for detailed analysis and model training. The linear regression model is in the featureengineering.ipynb file. The LightGMB model, in the MLModel_Optiver.ipynb file, uses the training table made in the featureengineering.ipynb. To skip the table creation access the table in csv or parquet here : https://drive.google.com/drive/folders/1qbmvTMvfCfUSKyeAJ88yG2WZsUqrCfB4?usp=sharing

REALIZED VOLATILITY PREDICTION - AI4ALL PORTFOLIO PROJECT

Benjamin Silva, Juan Buitrago, and Praise Olatide

An ML model used to predict realized volatility over 10-minute intervals.

Problem Statement

This project was developed out of interest in working with real-world competition data from the OPTIVER REALIZED VOLATILITY PREDICTION KAGGLE CHALLENGE.]

Key Results

Linear Regression (10 top correlated features, RMSE):

-Cross-Validation RMSE: 0.001397 (not directly comparable to leaderboard RMSPE).

-R² Score: ~0.7705

LightGBM Model (full feature set, RMSPE — comparable to competition metric):

-Training RMSPE: 0.286029

-Validation RMSPE: 0.298207

-Validation R² Score: 0.782160

Feature analysis revealed spread-related features (price_spread_avg_mean, price_spread_1_mean) as the strongest predictors.

Methodologies

Data Preparation

Merged and processed Kaggle-provided order book and trade data.

Engineered 46 statistical and volatility-based features per stock-time combination.

Selected top features using correlation with target value.

##Validation Strategy: Sequential split: train on earlier periods, validate on later periods to mimic real-world forecasting.

Avoided random splits to preserve time-series structure.

##Models Used Linear Regression: simple baseline using top-10 correlated features, scaled with StandardScaler.

LightGBM: gradient boosting model using the full feature set for improved accuracy.

Evaluation

RMSE used for internal Linear Regression evaluation.

RMSPE (competition metric) used for LightGBM to match leaderboard scoring.

Data Sources

Kaggle Dataset: https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview

Technologies Used:

Python

pandas, NumPy (data manipulation)

scikit-learn (Linear Regression, scaling, cross-validation)

LightGBM (gradient boosting model)

matplotlib, seaborn (visualization)

Authors

This project was completed in collaboration with:

Benjamin Silva Juan Buitrago Praise Olatide

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
download_data.py		download_data.py
requirements.txt		requirements.txt
run_streamlit.bat		run_streamlit.bat
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stock Volatility Prediction - AI4ALL Group Project

Setup Instructions

Running the Streamlit App

Data Analysis

REALIZED VOLATILITY PREDICTION - AI4ALL PORTFOLIO PROJECT

Benjamin Silva, Juan Buitrago, and Praise Olatide

Problem Statement

Key Results

Linear Regression (10 top correlated features, RMSE):

LightGBM Model (full feature set, RMSPE — comparable to competition metric):

Methodologies

Data Preparation

Evaluation

Data Sources

Technologies Used:

Authors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Praise-creator/AI4ALL-Group-Proj

Folders and files

Latest commit

History

Repository files navigation

Stock Volatility Prediction - AI4ALL Group Project

Setup Instructions

Running the Streamlit App

Data Analysis

REALIZED VOLATILITY PREDICTION - AI4ALL PORTFOLIO PROJECT

Benjamin Silva, Juan Buitrago, and Praise Olatide

Problem Statement

Key Results

Linear Regression (10 top correlated features, RMSE):

LightGBM Model (full feature set, RMSPE — comparable to competition metric):

Methodologies

Data Preparation

Evaluation

Data Sources

Technologies Used:

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages