This project is designed to predict stock market trends using traditional ML, deep learning, and a hybrid LSTM-CNN architecture. Below is the step-by-step progress with brief descriptions.
BullBearAI/
β
βββ data/ # Raw and processed stock data
β βββ raw/ # Untouched downloaded data
β βββ interim/ # Intermediate transformation outputs
β βββ processed/ # Cleaned and final datasets
β
βββ notebooks/ # Jupyter notebooks for EDA, modeling, evaluation
β βββ 01_eda.ipynb # Exploratory Data Analysis
β βββ 02_feature_engineering.ipynb # Feature engineering techniques
β βββ 03_ml_baselines.ipynb # Traditional ML models: SVM, RF, LR, Gradient Boosting
β βββ 04_time_series_models.ipynb # Time series statistical models: ARIMA, SARIMA, GARCH
β βββ 05_cnn_model.ipynb # CNN-based deep learning model
β βββ 06_lstm_model.ipynb # LSTM (RNN) based sequence model
β βββ 07_hybrid_cnn_lstm_model.ipynb # Hybrid CNN-LSTM deep model
β βββ 08_model_comparison.ipynb # Evaluation & performance comparison
β
βββ src/ # All source code
β βββ config/ # Configuration files and parameters
β β βββ config.yaml
β βββ data_loader/ # Data loading and preprocessing scripts
β β βββ load_data.py
β βββ features/ # Feature engineering functions
β β βββ technical_indicators.py
β βββ models/ # ML & DL model definitions
β β βββ arima_model.py
β β βββ svm_model.py
β β βββ cnn_model.py
β β βββ lstm_model.py
β β βββ hybrid_model.py
β βββ training/ # Training and validation loops
β β βββ train_model.py
β βββ evaluation/ # Metrics and model comparisons
β β βββ evaluate.py
β βββ visualization/ # Custom plotting functions
β βββ plot_utils.py
β
βββ saved_models/ # Checkpoints and final models (.h5 or .pth)
β
βββ reports/ # Analysis reports, result plots, performance graphs
β βββ figures/
β βββ model_comparison.md
β
βββ cli/ # Command-line tools for automation
β βββ run_train.py
β
βββ tests/ # Unit tests for various components
β βββ test_models.py
β βββ test_utils.py
β βββ test_data_loader.py
β
βββ requirements.txt # Python dependencies
βββ README.md # Project overview, setup, and usage
βββ LICENSE # License info
βββ .gitignore # Files to ignore in version control
- Loaded the raw stock market data (Netflix stock) from the
data/raw/
directory. - Verified file integrity, parsed dates correctly, and ensured data types were appropriate.
- Saved a clean version in
data/processed/netflix_cleaned.csv
.
- Removed duplicates and handled any missing/null values.
- Renamed columns for consistency and usability (
Close/Last
instead ofClose*
). - Converted all date fields to
datetime
format. - Ensured data is sorted chronologically.
- Exported cleaned dataset to
data/processed/
.
- Visualized time-series trends of
Close
,Volume
, andOpen
. - Used Seaborn and Matplotlib for:
- Moving averages
- Seasonal decomposition
- Daily/Monthly return distributions
- Checked for trends, volatility, and patterns.
- Identified data gaps, outliers, or anomalies.
- All EDA work is saved in
notebooks/01_eda.ipynb
.
Performed a comprehensive set of transformations to prepare predictive features:
- Extracted:
Year
,Month
,Day
,DayOfWeek
, andIsWeekend
.
- Created lagged versions of
Close/Last
andVolume
(lags: 1, 2, 3 days).
- Computed rolling means, medians, stds, max, min for 7, 14, and 30-day windows.
- Daily percentage change, return, and rolling return metrics.
- Simple & Exponential Moving Averages (SMA, EMA)
- RSI (Relative Strength Index)
- MACD (Moving Average Convergence Divergence)
- Bollinger Bands
Target_Close_Next_Day
: Next dayβs close priceTarget_UpDown
: Binary classification target (1 = price goes up, 0 = down)
Engineered dataset saved to: data/interim/engineered_features.csv
.
This notebook builds baseline regression models to predict:
Target_Close_Next_Day
β the actual next-day closing price of the stock.
Implemented Models:
- Linear Regression
- Support Vector Regression (SVR)
- Random Forest Regressor
- Gradient Boosting Regressor
Highlights:
-
Models trained on engineered features including lag features, rolling window stats, and technical indicators (e.g., RSI, MACD, Bollinger Bands).
-
Evaluation metrics include:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- RΒ² Score
-
Model Performance Metrics
Model MAE RMSE RΒ² Score LR 19.25 22.43 -0.30 SVR 27.82 34.50 -2.08 RF 9.04 11.88 0.63 GB 8.72 11.40 0.66 -
Visualizations:
- Actual vs Predicted Prices (line plot)
- Residual Plot (errors)
- MAE & RMSE comparison bar charts
This section compares three powerful time series models:
- ARIMA: Captures trend using autoregressive and moving average components.
- SARIMA: Extends ARIMA by modeling seasonality.
- GARCH: Models time-varying volatility (useful for financial series).
Model | MAE | RMSE |
---|---|---|
ARIMA | 6.134887 | 15.929801 |
SARIMA | 19.205966 | 21.711764 |
- MAE (Mean Absolute Error): Measures average absolute errors.
- RMSE (Root Mean Squared Error): Penalizes large errors more.
- ARIMA works well for capturing trend but may struggle with seasonality.
- SARIMA provides improved results when seasonality is present.
- GARCH is useful to understand and forecast volatility (especially useful in financial data like stock prices).
Use deep learning (CNN) to model patterns in stock price sequences and predict future values with better local feature extraction than traditional models.
Step | Description |
---|---|
Scaling | Applies MinMaxScaler to normalize prices between 0 and 1. |
Sequence Generation | Converts time series into sequences using sliding windows. |
CNN Architecture | 1D Convolution + MaxPooling + Dense layers. |
Training | Compiled with adam optimizer and mse loss. |
Evaluation | MAE, RMSE, and future price predictions plotted. |
Metric | Value |
---|---|
MAE | 9.66 |
RMSE | 11.93 |
Leverage LSTM (a variant of RNN) for time series forecasting of stock prices using historical closing data. LSTMs are well-suited for sequential data due to their ability to preserve long-term memory and overcome the vanishing gradient problem in vanilla RNNs.
-
Close/Last
: Normalized closing price. -
Target_Close_Next_Day
: Target value to predict (next dayβs closing price). -
LSTM Architecture:
- Contains memory cells with gates (input, forget, and output).
- Capable of learning both short-term and long-term temporal patterns.
-
Sliding Window: We use 60-day historical windows to predict the next day's price.
-
EarlyStopping: To avoid overfitting (patience = 10)
Metric | Value |
---|---|
MAE | 8.3355 |
RMSE | 10.2783 |