In this project, I am using Amazon Prime Video Dataset which have more than 4000+ rows and 16 features, to build a prediction model to predict whether a movie is going to perform well on the platform (cvt_per_day).
In the process of getting the best prediction model, I have done:
-
Cleaning up the missing values, replacing them with NaN, and filling up the missing value with column means
-
Feature Understanding& Feature Engineering: Convert the categorical features to dummy variables, especially on the "Genres" feature,which I combined the six least frequent genres into one for model training.
-
Data Visualization: Using both logscale due to data variation and heatmap to vividly show the dataset
-
Model Training: Use Linear Model such as Lasso/Ridge in both linear and polynomial form as well as Non-linear model Random Forest to train the model, fine-tune the parameters to get the optimal results.
-
Model Evaluations: Comparing five different models using evaluation metrics such as: R square/score, MSE, RMSE to decide the best model
-
Feature Importance: After evaluating the best model(Random Forest), rank which features are the best to predict the models.