This project evaluates the performance of various machine learning models on a dataset using different sampling techniques. The objective is to understand how different sampling methods impact the accuracy of machine learning models and to identify the best-performing combination.
The dataset used for this project is [Credit Card Data]. The dataset is highly imbalanced, and it is converted into a balanced class dataset using sampling techniques.
The following sampling techniques were used:
- Simple Random Sampling: Selects a subset of individuals randomly from the larger dataset.
- Stratified Sampling:Divides the population into homogeneous subgroups before sampling.
- Cluster Sampling:Divides the population into clusters and randomly selects entire clusters.
- Bootstrap Sampling:Selects a subset of individuals randomly with replacement from a larger dataset, allowing the same individual to be selected multiple times.
- Systematic Sampling: Selects samples based on a fixed periodic interval.
The following machine learning models were evaluated:
- Random Forest
- Logistic Regression
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Decision Tree
This project is implemented in Python, using libraries such as Pandas, NumPy, scikit-learn, and imbalanced-learn. Ensure you have these installed:
pip install pandas numpy scikit-learn imbalanced-learn- Clone the repository or download the project to your local machine
- Place your dataset in the root directory or modify the dataset path in the script.
- Run the script using Python
python sampling.py- Data Preprocessing:
- Handle missing values, if any.
- Normalize or standardize the data for models like SVM and KNN.
- Encode categorical features as required.
- Balancing the Dataset:
- Applied sampling techniques to create balanced datasets.
- Model Training and Evaluation:
- Split the dataset into training and testing sets.
- Train models using each sampling technique.
- Evaluate the models using accuracy as the performance metric.
- Result Analysis:
- Compare the accuracy of models across different sampling techniques.
- Identify the best combination of model and sampling method.
Results are saved in a pivot table in the Results folder which includes all results and the best results.In all__results, each row represents a machine learning model and each column a sampling technique and in best_results, the best combination of model and sampling method is shown.