Skip to content

xema1901/car_purchase_data_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Overview

Applying the various algorithms and methods discussed in class to a sales dataset obtained from Kaggle, our team chose to focus our project mainly on building a predictive model for our target variable – from the initial exploratory visualization and analysis to hyperparameter tuning to find the best model – using an object-oriented programming approach. The objective of the project would be to keep all the logic written inside the classes we’ve created in order to ensure ease of replication for further improvements. The model fits into a classification problem because the target variable determines whether a customer purchases a given commodity. This type of problem is incredibly important because determining the probabilities of a customer purchasing a given commodity is an everyday problem for any business. Given this information, businesses can adjust their sales strategies in order to take advantage of those groups that are more likely to purchase their product. Across the scope of our project, we have written functions to perform every aspect of the predictive model building process, using the features of object-oriented programming to improve execution times as compared to using more traditional techniques, such as in-built functions within packages like pandas. Keeping all our functions within the class, we implemented various visualization packages and techniques for exploratory data analysis, implemented Monte Carlo simulations to calculate probabilities of purchasing the commodity per customer segment. compared time complexity between sorting techniques, and again to brute-force and randomized-search algorithms to find the ideal combination of parameters for hyperparameter tuning. The dataset has approximately 23,500+ observations and 15 attributes giving whether the customer has purchased a car or not. As the dataset on Kaggle is only a year old, there are not many other related projects and those that do exist focus primarily on building various predictive models – ranging from an XGBoost model to a Random Forest model and even a logistic regression model, with their accuracy ranging from approximately 55% to 70%. Our initial exploratory visualization allowed us to glean a few insights about the various categories of customers that are more likely to purchase a car. The Monte Carlo simulation generated the average probability of a customer from a given customer segment purchasing a car over 1,000 simulations. Time complexity comparisons have highlighted the fact that although the brute force algorithm will produce the best combination of parameters, the randomized search algorithm we built produced results nearly as good as the brute force algorithm, while taking substantially less time to obtain the solution. In addition, our project has shown that out of nearly 10 models, XGBoost proved to be our best.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published