Skip to content

corumto/ML_stock_prediction

Repository files navigation

🧠 Stock Market Buy/Sell Prediction using Time Series & Sentiment Analysis

This project aims to predict buy, sell, or hold signals for a given stock using a combination of time series forecasting, financial indicators, and sentiment analysis from both structured and unstructured data sources.

It uses:

  • TensorFlow and Keras for deep learning (LSTM models with attention mechanisms)
  • MLflow for experiment tracking and model versioning
  • scikit-learn, pandas, NumPy for data processing and time series analysis
  • Yahoo Finance for historical stock price data
  • SEC EDGAR for financial filings (10-K, 10-Q) sentiment analysis
  • NewsAPI for financial news sentiment
  • Bluesky for social media sentiment
  • FinBERT (HuggingFace Transformers) for financial sentiment analysis

πŸ“Š Project Overview

The model combines:

  • Financial Time Series Analysis – Historical stock price data with technical indicators (RSI, Bollinger Bands, moving averages)
  • Textual Sentiment Analysis – Extracted from SEC filings, news articles, and social media using FinBERT
  • Feature Fusion and Deep Learning – LSTM-based model with optional attention mechanism to output a three-class classification: Hold (0), Buy (1), or Sell (2)

πŸ—‚οΈ Data Sources

Source Description
Yahoo Finance Historical stock prices (Open, High, Low, Close, Volume)
SEC EDGAR 10-K and 10-Q filings with FinBERT sentiment analysis
NewsAPI Financial news articles with sentiment scores
Bluesky Social media posts with sentiment analysis

βš™οΈ Tech Stack

Category Tools
Programming Language Python 3.11+
Deep Learning TensorFlow 2.20+, Keras 3.12+
Experiment Tracking MLflow
Data Processing Pandas, NumPy, scikit-learn
Stock Data yfinance
Text Analysis HuggingFace Transformers (FinBERT), NLTK
APIs NewsAPI, sec-edgar-downloader
Visualization Matplotlib, Seaborn

🧱 Project Structure

Final Project/
β”‚
β”œβ”€β”€ main.ipynb              # Main Jupyter notebook with complete pipeline
β”œβ”€β”€ combined_data_aapl.csv  # Processed combined dataset
β”œβ”€β”€ bsky_posts.csv  # Processed BlueSky posts
β”‚
β”œβ”€β”€ models/                 # Saved model files
β”‚   └── lstm_stock_prediction.h5
β”‚
β”œβ”€β”€ sec-edgar-filings/      # Downloaded SEC filings
β”‚
β”œβ”€β”€ mlruns/                 # MLflow experiment logs
β”‚
β”œβ”€β”€ Readme.md
└── requirements.txt

Main Notebook Structure (main.ipynb)

The notebook is organized into the following sections:

  1. Setup & Imports

    • TensorFlow/Keras configuration
    • Library imports and logging setup
  2. Sentiment Analysis

    • FinBERT model loading and sentiment calculation functions
  3. Data Collection

    • fetch_stock_data_yfinance() - Yahoo Finance stock data
    • fetch_sec_filings() - SEC EDGAR filings with sentiment
    • fetch_news_articles() - NewsAPI articles with sentiment
  4. Feature Engineering

    • create_buy_sell_labels() - Three-class label creation (Hold/Buy/Sell)
    • calculate_technical_indicators() - RSI, Bollinger Bands, moving averages
    • prepare_features_for_lstm() - Sequence preparation for LSTM
    • aggregate_sentiment_by_date() - Daily sentiment aggregation
  5. Data Collection Examples

    • SEC filings processing pipeline
    • News article collection examples
    • Bluesky sentiment integration
  6. Model Architecture

    • create_lstm_model() - Standard LSTM model builder
    • create_attention_lstm_model() - LSTM with attention mechanism
    • train_model_with_callbacks() - Training with early stopping, checkpointing
    • evaluate_model_tf() - Comprehensive model evaluation
  7. Training Pipeline

    • Complete end-to-end training workflow
    • MLflow experiment tracking
    • Train/validation/test split
  8. Visualization

    • ROC curves and confusion matrices
    • Training history plots
  9. Backtesting

    • Trading strategy evaluation
    • Profit/loss calculation
    • Performance visualization

πŸš€ Setup & Installation

# Create a virtual environment
python -m venv venv
source venv/bin/activate    # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables (optional, for NewsAPI)
# Set NEWS_API_KEY environment variable if using news articles
# Set BLUESKY_HANDLE and BLUESKY_PASSWORD variables if using bluesky posts

Environment Variables

  • NEWS_API_KEY: API key for NewsAPI to fetch financial news articles. This is from newsapi.org.
  • BLUESKY_HANDLE: API key for Bluesky to fetch posts. This is from bsky.app.
  • BLUESKY_PASSWORD: API key for Bluesky to fetch posts. This is from bsky.app.

🧠 Model Workflow

1. Data Collection

The notebook provides functions to collect data from multiple sources:

  • Stock Prices: fetch_stock_data_yfinance(symbol, start_date, end_date)

    • Fetches historical OHLCV data from Yahoo Finance
  • SEC Filings: fetch_sec_filings(symbol, start_date, end_date)

    • Downloads 10-K and 10-Q filings from SEC EDGAR
    • Extracts text and calculates FinBERT sentiment scores
  • News Articles: fetch_news_articles(symbol, start_date, end_date)

    • Fetches financial news from NewsAPI
    • Calculates sentiment for each article

2. Feature Engineering

  • Technical Indicators:

    • Moving Averages (SMA_5, SMA_10, SMA_20)
    • RSI (Relative Strength Index)
    • Bollinger Bands (upper, lower, width, position)
    • Volume indicators (Volume_SMA, Volume_ratio)
    • Price change indicators
  • Label Creation: create_buy_sell_labels()

    • Creates three-class labels based on future price movements
    • Hold (0): Price change within threshold (-2% to +2%)
    • Buy (1): Future price increase > 2%
    • Sell (2): Future price decrease < -2%
  • Sentiment Aggregation:

    • Aggregates daily sentiment scores from news, SEC filings, and social media
    • Merges sentiment features with stock price data

3. Model Architecture

The notebook supports two LSTM architectures:

  • Standard LSTM: Multi-layer LSTM with dropout and batch normalization
  • Attention LSTM: LSTM with attention mechanism for better sequence modeling

Default architecture:

  • Input: 60-day lookback window with 32 features
  • LSTM layers: [128, 64] units
  • Dense layers: [32] units
  • Output: 3 classes (Hold/Buy/Sell) using softmax activation

4. Training

  • Data Split: Chronological split (70% train, 15% validation, 15% test)
  • Callbacks:
    • Early stopping (patience=10)
    • Model checkpointing
    • Learning rate reduction on plateau
  • MLflow Tracking: Automatic logging of hyperparameters, metrics, and model artifacts

5. Evaluation

  • Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC
  • Visualizations:
    • Confusion matrix
    • ROC curves (multi-class one-vs-rest)
    • Training history (loss and accuracy)
  • Backtesting: Trading strategy evaluation with profit/loss calculation

6. Usage Example

# In main.ipynb, run cells sequentially:

# 1. Setup (Cell 1)
# 2. Load functions (Cells 2-5)
# 3. Collect data (Cells 6-12)
# 4. Train model (Cell 40)
# 5. Visualize results (Cell 41)
# 6. Backtest strategy (Cells 42-47)

πŸ“ˆ MLflow Experiment Tracking

The notebook automatically logs experiments to MLflow:

# Experiments are logged in Cell 40 during training
mlflow.set_experiment('stock-prediction-lstm')

with mlflow.start_run():
    # Hyperparameters, metrics, and model are automatically logged
    history = train_model_with_callbacks(...)

View results:

mlflow ui
# Then visit http://localhost:5000

πŸ§ͺ Key Functions

Sentiment Analysis

from main.ipynb import calculate_sentiment_finbert

sentiment = calculate_sentiment_finbert("Apple reports strong earnings")
# Returns: {'positive': 0.85, 'negative': 0.10, 'neutral': 0.05, 'score': 0.75}

Model Creation

from main.ipynb import create_lstm_model

model = create_lstm_model(
    input_shape=(60, 32),  # 60 timesteps, 32 features
    lstm_units=[128, 64],
    dense_units=[32],
    dropout_rate=0.2,
    num_outputs=3  # Hold/Buy/Sell
)

Feature Preparation

from main.ipynb import prepare_features_for_lstm

X, y, scaler, dates = prepare_features_for_lstm(
    df_labeled,
    feature_columns=feature_cols,
    lookback=60
)

πŸ“‹ Features

  • Three-Class Classification: Hold (0), Buy (1), Sell (2) predictions
  • Multi-Source Sentiment: Combines sentiment from SEC filings, news articles, and social media
  • Technical Indicators: Comprehensive set of 20+ technical indicators
  • LSTM with Attention: Optional attention mechanism for improved sequence modeling
  • MLflow Integration: Automatic experiment tracking and model versioning
  • Backtesting: Evaluate trading strategies on historical data
  • Visualization: ROC curves, confusion matrices, and training history plots

🧭 Future Improvements

  • Integrate Reinforcement Learning for trading strategy optimization
  • Add Transformer-based architectures for sequence modeling
  • Real-time streaming data ingestion
  • Automated hyperparameter tuning with Optuna or KerasTuner
  • Model explainability with SHAP or LIME
  • Integration with FRED API for macroeconomic indicators
  • Additional social media sources (Reddit, Twitter)

πŸ“š References

πŸ›‘οΈ Disclaimer

This project is for educational and research purposes only. It is not financial advice. Investing in financial markets involves risk.

About

Stock Market Buy/Sell Prediction using Time Series, Sentiment Analysis and BERT Encoders

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors