Skip to content

sehan-dissanayake/Intellihack_Scope_04

Repository files navigation

Stock Price Prediction Challenge - README

Project Overview

This project addresses the Stock Price Prediction Challenge, aiming to predict a stock's closing price 5 trading days into the future using historical data from March 17, 1980, to December 27, 2024. The dataset includes columns: Date, Adj Close, Close, High, Low, Open, and Volume. The solution involves exploratory data analysis (EDA), feature engineering, model selection, improvements, simulated trading performance, and deployment of a 5-day forecast. The final model is an optimized LSTM neural network, achieving high predictive accuracy and practical trading value.

Approach

1. Data Preprocessing

  • Missing Values Handling:
    • Dropped rows with missing Date values (110 rows).
    • Replaced invalid Open values (zeros) with the previous day's Close, followed by forward fill.
    • Forward-filled price columns (Close, Adj Close, High, Low, Open).
    • Interpolated Volume using linear interpolation.
    • Applied backward fill for remaining NaN values.
  • Anomaly Correction:
    • Corrected negative daily price ranges (High < Low) by swapping values.
    • Clipped extreme Close prices (>3 standard deviations) to stabilize training.

2. Exploratory Data Analysis (EDA)

  • Trends:
    • Long-term upward trend from ~$3 (1980) to ~$200 (2024), with significant growth post-2010.
    • Medium-term volatility (2020-2024): Drop to ~$160 (COVID crash), recovery to ~$240 (mid-2021), peak at $260, settling at $199.52 (Dec 2024).
  • Seasonality:
    • Minimal seasonality, with slight end-of-year increases (e.g., December 2024).
  • Anomalies:
    • Volatility spikes in 1987, 2000, 2008, 2020, and late 2024, aligning with market events (e.g., Black Monday, Financial Crisis, COVID-19).
    • Outliers above $200 confirmed by boxplot, linked to these events.
  • Volume Correlation:
    • Volume surges (e.g., 1.28M on Dec 26, 2024) correlate with price movements, indicating trading activity drives volatility.

3. Feature Engineering

  • Lagged Close Prices (1-5, 10 days): Captures short- and medium-term momentum.
  • Rolling Mean/Std (5, 10 days): Reflects trends and volatility.
  • Volume Change (5, 10 days): Captures trading activity trends.
  • Day of Week: Tests for weekly patterns.
  • Crash Indicator: Flags high-volatility periods using residuals (>2σ from a linear trend).

4. Model Selection and Improvements

  • Initial Models:
    • ARIMA: RMSE 42.78, R² -0.86 (poor performance due to non-linear patterns).
    • LSTM: RMSE 5.87, R² 0.96 (best initial performance).
    • XGBoost: RMSE 28.14, R² 0.20 (moderate performance).
  • Improved LSTM:
    • Added deeper architecture (two LSTM layers, batch normalization, dense layer with L2 regularization).
    • Expanded features (e.g., Lag_10, Volume_Change_10).
    • Optimized hyperparameters via cross-validation (10 folds):
      • Best Parameters: units=150, dropout_rate=0.3, learning_rate=0.0005, epochs=30, batch_size=128.
      • Cross-Validated RMSE: 4.65.
    • Final Performance: RMSE 5.12, MAE 3.89, R² 0.98.
    • Improved handling of volatility (e.g., mid-2024 peak).

5. Simulated Trading Performance

  • Strategy: Buy if predicted price increases >2%, sell if decreases <-2%, otherwise hold.
  • Results: (To be updated after running the code; expected directional accuracy ~65% and positive cumulative returns).

6. Deployment

  • 5-Day Forecast: Generated predictions for the next 5 trading days after Dec 27, 2024, saved in predictions.csv.
  • Files Generated:
    • engineered_features.csv: Dataset with engineered features.
    • test_predictions.csv: Actual vs. predicted prices for the test set.
    • predictions.csv: 5-day forecast results.

Key Findings

  • Predictive Accuracy:
    • The optimized LSTM model achieves an RMSE of 5.12 (improved from the initial 5.87) and an R² of 0.98, indicating high accuracy in capturing trends and volatility.
    • The model effectively predicts stable periods and moderately handles volatility (e.g., 2024 peak), though it slightly underpredicts extreme spikes.
  • Practical Value:
    • Simulated trading performance (pending) is expected to show positive returns, demonstrating the model’s utility for trading decisions.
  • Limitations:
    • The model struggles with sudden market shifts (e.g., rapid 2024 spike), suggesting a need for external data (e.g., news sentiment).
    • High R² (0.98) may indicate overfitting; further validation is recommended.

Instructions to Reproduce Results

1. Prerequisites

  • Environment: Python 3.8+ (recommended in a Jupyter Notebook environment like Kaggle or Google Colab).
  • Dependencies:
    pip install numpy pandas matplotlib seaborn scikit-learn statsmodels tensorflow xgboost
  • Hardware: A GPU is recommended for faster LSTM training, though a CPU will suffice (longer training time).

2. Dataset

  • File: question4-stock-data.csv (located in /kaggle/input/stockprice/ on Kaggle).
  • Structure: Columns include Date, Adj Close, Close, High, Low, Open, Volume.

3. Steps to Reproduce

  1. Clone the Repository (if applicable):

    git clone <repository-url>
    cd <repository-directory>

    Alternatively, download the Jupyter Notebook (Stock_Price_Prediction.ipynb) and associated files.

  2. Set Up the Environment:

    • Install dependencies as listed above.
    • Ensure the dataset is accessible in the correct path (/kaggle/input/stockprice/question4-stock-data.csv).
  3. Run the Jupyter Notebook:

    • Open Stock_Price_Prediction.ipynb in Jupyter Notebook or Kaggle.
    • Execute all cells sequentially to:
      • Load and preprocess the data.
      • Perform EDA (visualizations saved as plots).
      • Engineer features (saved as engineered_features.csv).
      • Train and evaluate models (initial and improved LSTM).
      • Generate predictions and plots.
  4. Key Sections to Focus On:

    • EDA: Visualizations (e.g., closing price over time, seasonality decomposition) are generated and saved.
    • Model Improvements: The final LSTM code block implements the optimized model with the best parameters (units=150, etc.).
    • Output Files:
      • engineered_features.csv: Engineered dataset.
      • test_predictions.csv: Test set predictions.
      • predictions.csv: 5-day forecast.
  5. Verify Results:

    • Check the final model performance (expected RMSE ~5.12, R² ~0.98).
    • Review the prediction plot for alignment between actual and predicted values.
    • Run the simulated trading section to evaluate practical performance (update results in the README).

4. Expected Output

  • Plots:
    • Closing price over time, monthly averages, volatility plots, correlation matrix, and final prediction plots.
  • Metrics:
    • Initial LSTM: RMSE 5.87, R² 0.96.
    • Improved LSTM: RMSE 5.12, MAE 3.89, R² 0.98.
  • Files:
    • engineered_features.csv, test_predictions.csv, predictions.csv.

Troubleshooting

  • Memory Issues: If the LSTM training crashes, reduce the batch size (e.g., to 64) or use a smaller grid for hyperparameter tuning.
  • Dependency Errors: Ensure all packages are installed correctly; use a virtual environment if conflicts arise.
  • Data Path Errors: Adjust the dataset path in the code if not using Kaggle (e.g., update to local path).

Future Improvements

  • External Data: Incorporate news sentiment or macroeconomic indicators to better capture sudden market shifts.
  • Ensemble Models: Combine LSTM with XGBoost for potentially higher accuracy.
  • Real-Time Deployment: Integrate the model into a live trading system with continuous updates.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors