This project applies data mining and machine learning to predict Player Efficiency Rating (PER) using advanced NBA statistics. Built using Python, the model draws from 11 seasons of NBA data (2014–2024) and was designed with a focus on predictive accuracy, feature analysis, and real-world usability.
- Goal: Predict a player's PER based on 18 statistical features
- Method: Linear Regression (with cross-validation and model comparison)
- Dataset: 5,792 NBA player-seasons, filtered to 4,490 after preprocessing
- Outcome: Reliable model (R² = 0.9542) with a real-time prediction tool
The code is meant to be run in the following order:
split.py– Prepares the normalized datasetmodelTraining.py– Trains the Linear Regression modelcrossValidation.py– Evaluates performance using 5-fold validationfeatureImportance.py– Displays most influential statsmodelTesting.py– Tests the model on holdout datamodelComparison.py– Compares performance of multiple modelspredictiveTool.py– Interactive prediction tool using player inputs
- R²: 0.9542
- MAE: 0.0283
- RMSE: 0.0363
- Avg Difference: ±1.81 PER points (based on test players)
- Minutes Played (MP), Field Goals (FG, FGA), Free Throws (FT, FTA)
- Rebounds (TRB), Assists (AST), Points (PTS)
- Turnovers (PTOV), Fouls Drawn (SFD), Assists Generated (PGA), And1s
- Advanced metrics like TS%, USG%, WS, BPM, VORP, ORtg
All features are scaled using min-max normalization for model compatibility.
pandas,numpy,scikit-learn,matplotlib,seaborn,joblib
- Download the transformed CSV dataset
- Copy its file path and paste it into
split.py - Run
split.pyandmodelTraining.pyfirst (this step saves the model) - Then run
predictiveTool.pyto generate PER predictions from user input
The predictive tool includes a built-in test set of 28 players (1997–2024) to verify model accuracy.