This project aims to predict buy, sell, or hold signals for a given stock using a combination of time series forecasting, financial indicators, and sentiment analysis from both structured and unstructured data sources.
It uses:
- TensorFlow and Keras for deep learning (LSTM models with attention mechanisms)
- MLflow for experiment tracking and model versioning
- scikit-learn, pandas, NumPy for data processing and time series analysis
- Yahoo Finance for historical stock price data
- SEC EDGAR for financial filings (10-K, 10-Q) sentiment analysis
- NewsAPI for financial news sentiment
- Bluesky for social media sentiment
- FinBERT (HuggingFace Transformers) for financial sentiment analysis
The model combines:
- Financial Time Series Analysis β Historical stock price data with technical indicators (RSI, Bollinger Bands, moving averages)
- Textual Sentiment Analysis β Extracted from SEC filings, news articles, and social media using FinBERT
- Feature Fusion and Deep Learning β LSTM-based model with optional attention mechanism to output a three-class classification: Hold (0), Buy (1), or Sell (2)
| Source | Description |
|---|---|
| Yahoo Finance | Historical stock prices (Open, High, Low, Close, Volume) |
| SEC EDGAR | 10-K and 10-Q filings with FinBERT sentiment analysis |
| NewsAPI | Financial news articles with sentiment scores |
| Bluesky | Social media posts with sentiment analysis |
| Category | Tools |
|---|---|
| Programming Language | Python 3.11+ |
| Deep Learning | TensorFlow 2.20+, Keras 3.12+ |
| Experiment Tracking | MLflow |
| Data Processing | Pandas, NumPy, scikit-learn |
| Stock Data | yfinance |
| Text Analysis | HuggingFace Transformers (FinBERT), NLTK |
| APIs | NewsAPI, sec-edgar-downloader |
| Visualization | Matplotlib, Seaborn |
Final Project/
β
βββ main.ipynb # Main Jupyter notebook with complete pipeline
βββ combined_data_aapl.csv # Processed combined dataset
βββ bsky_posts.csv # Processed BlueSky posts
β
βββ models/ # Saved model files
β βββ lstm_stock_prediction.h5
β
βββ sec-edgar-filings/ # Downloaded SEC filings
β
βββ mlruns/ # MLflow experiment logs
β
βββ Readme.md
βββ requirements.txt
The notebook is organized into the following sections:
-
Setup & Imports
- TensorFlow/Keras configuration
- Library imports and logging setup
-
Sentiment Analysis
- FinBERT model loading and sentiment calculation functions
-
Data Collection
fetch_stock_data_yfinance()- Yahoo Finance stock datafetch_sec_filings()- SEC EDGAR filings with sentimentfetch_news_articles()- NewsAPI articles with sentiment
-
Feature Engineering
create_buy_sell_labels()- Three-class label creation (Hold/Buy/Sell)calculate_technical_indicators()- RSI, Bollinger Bands, moving averagesprepare_features_for_lstm()- Sequence preparation for LSTMaggregate_sentiment_by_date()- Daily sentiment aggregation
-
Data Collection Examples
- SEC filings processing pipeline
- News article collection examples
- Bluesky sentiment integration
-
Model Architecture
create_lstm_model()- Standard LSTM model buildercreate_attention_lstm_model()- LSTM with attention mechanismtrain_model_with_callbacks()- Training with early stopping, checkpointingevaluate_model_tf()- Comprehensive model evaluation
-
Training Pipeline
- Complete end-to-end training workflow
- MLflow experiment tracking
- Train/validation/test split
-
Visualization
- ROC curves and confusion matrices
- Training history plots
-
Backtesting
- Trading strategy evaluation
- Profit/loss calculation
- Performance visualization
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables (optional, for NewsAPI)
# Set NEWS_API_KEY environment variable if using news articles
# Set BLUESKY_HANDLE and BLUESKY_PASSWORD variables if using bluesky postsNEWS_API_KEY: API key for NewsAPI to fetch financial news articles. This is from newsapi.org.BLUESKY_HANDLE: API key for Bluesky to fetch posts. This is from bsky.app.BLUESKY_PASSWORD: API key for Bluesky to fetch posts. This is from bsky.app.
The notebook provides functions to collect data from multiple sources:
-
Stock Prices:
fetch_stock_data_yfinance(symbol, start_date, end_date)- Fetches historical OHLCV data from Yahoo Finance
-
SEC Filings:
fetch_sec_filings(symbol, start_date, end_date)- Downloads 10-K and 10-Q filings from SEC EDGAR
- Extracts text and calculates FinBERT sentiment scores
-
News Articles:
fetch_news_articles(symbol, start_date, end_date)- Fetches financial news from NewsAPI
- Calculates sentiment for each article
-
Technical Indicators:
- Moving Averages (SMA_5, SMA_10, SMA_20)
- RSI (Relative Strength Index)
- Bollinger Bands (upper, lower, width, position)
- Volume indicators (Volume_SMA, Volume_ratio)
- Price change indicators
-
Label Creation:
create_buy_sell_labels()- Creates three-class labels based on future price movements
- Hold (0): Price change within threshold (-2% to +2%)
- Buy (1): Future price increase > 2%
- Sell (2): Future price decrease < -2%
-
Sentiment Aggregation:
- Aggregates daily sentiment scores from news, SEC filings, and social media
- Merges sentiment features with stock price data
The notebook supports two LSTM architectures:
- Standard LSTM: Multi-layer LSTM with dropout and batch normalization
- Attention LSTM: LSTM with attention mechanism for better sequence modeling
Default architecture:
- Input: 60-day lookback window with 32 features
- LSTM layers: [128, 64] units
- Dense layers: [32] units
- Output: 3 classes (Hold/Buy/Sell) using softmax activation
- Data Split: Chronological split (70% train, 15% validation, 15% test)
- Callbacks:
- Early stopping (patience=10)
- Model checkpointing
- Learning rate reduction on plateau
- MLflow Tracking: Automatic logging of hyperparameters, metrics, and model artifacts
- Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC
- Visualizations:
- Confusion matrix
- ROC curves (multi-class one-vs-rest)
- Training history (loss and accuracy)
- Backtesting: Trading strategy evaluation with profit/loss calculation
# In main.ipynb, run cells sequentially:
# 1. Setup (Cell 1)
# 2. Load functions (Cells 2-5)
# 3. Collect data (Cells 6-12)
# 4. Train model (Cell 40)
# 5. Visualize results (Cell 41)
# 6. Backtest strategy (Cells 42-47)The notebook automatically logs experiments to MLflow:
# Experiments are logged in Cell 40 during training
mlflow.set_experiment('stock-prediction-lstm')
with mlflow.start_run():
# Hyperparameters, metrics, and model are automatically logged
history = train_model_with_callbacks(...)View results:
mlflow ui
# Then visit http://localhost:5000from main.ipynb import calculate_sentiment_finbert
sentiment = calculate_sentiment_finbert("Apple reports strong earnings")
# Returns: {'positive': 0.85, 'negative': 0.10, 'neutral': 0.05, 'score': 0.75}from main.ipynb import create_lstm_model
model = create_lstm_model(
input_shape=(60, 32), # 60 timesteps, 32 features
lstm_units=[128, 64],
dense_units=[32],
dropout_rate=0.2,
num_outputs=3 # Hold/Buy/Sell
)from main.ipynb import prepare_features_for_lstm
X, y, scaler, dates = prepare_features_for_lstm(
df_labeled,
feature_columns=feature_cols,
lookback=60
)- Three-Class Classification: Hold (0), Buy (1), Sell (2) predictions
- Multi-Source Sentiment: Combines sentiment from SEC filings, news articles, and social media
- Technical Indicators: Comprehensive set of 20+ technical indicators
- LSTM with Attention: Optional attention mechanism for improved sequence modeling
- MLflow Integration: Automatic experiment tracking and model versioning
- Backtesting: Evaluate trading strategies on historical data
- Visualization: ROC curves, confusion matrices, and training history plots
- Integrate Reinforcement Learning for trading strategy optimization
- Add Transformer-based architectures for sequence modeling
- Real-time streaming data ingestion
- Automated hyperparameter tuning with Optuna or KerasTuner
- Model explainability with SHAP or LIME
- Integration with FRED API for macroeconomic indicators
- Additional social media sources (Reddit, Twitter)
- TensorFlow Documentation
- Keras API Reference
- MLflow Documentation
- Yahoo Finance (yfinance)
- SEC EDGAR
- FinBERT Model
- NewsAPI
This project is for educational and research purposes only. It is not financial advice. Investing in financial markets involves risk.