Machine Learning Project for User Classification

This project aims to classify users based on their interaction data using various machine learning models. The project involves data preprocessing, feature engineering, model training, and evaluation.

Project Overview

The goal of this project is to classify users into different categories based on their interaction data. The dataset used for this project contains information about users' activities on a platform, such as the number of days on the platform, minutes watched, courses started, practice exams started and passed, and minutes spent on exams.

Data Preprocessing

Data preprocessing is a crucial step in any machine learning project. In this project, the following preprocessing steps were performed:

Importing the Data: The data was imported from a CSV file named ml_datasource.csv.
Removing Outliers: Outliers were removed based on specific thresholds for minutes_watched, courses_started, practice_exams_started, and minutes_spent_on_exams.
Checking for Multicollinearity: Variance Inflation Factor (VIF) was calculated to check for multicollinearity among the features. Features with high VIF values were dropped to prevent multicollinearity.
Dealing with NaN Values: Missing values in the student_country column were replaced with the string 'NAM'.
Encoding Categorical Variables: The student_country column was encoded using an ordinal encoder.

Model Training and Evaluation

Several machine learning models were trained and evaluated on the preprocessed data:

Logistic Regression: A logistic regression model was trained using the sm.Logit function from the statsmodels library.
K-Nearest Neighbors (KNN): A KNN model was trained using the KNeighborsClassifier from the sklearn library. Grid search was used to find the optimal hyperparameters.
Support Vector Machines (SVM): An SVM model was trained using the SVC from the sklearn library.
Decision Trees: A decision tree model

K-Nearest Neighbors (KNN)

# Define the parameter grid for the grid search
parameters_knn = {'n_neighbors': range(1, 51), 'weights': ['uniform', 'distance']}

# Initialize a GridSearchCV object with KNeighborsClassifier estimator
grid_search_knn = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=parameters_knn, scoring='accuracy')

# Fit the grid search object on the training data
grid_search_knn.fit(x_train_array, y_train_array)

# Store the best estimator (model with optimal parameters) in knn_clf
knn_clf = grid_search_knn.best_estimator_

# Predict the target variable for the test dataset using the optimal model
y_test_pred_knn = knn_clf.predict(x_test_array)

Support Vector Machines (SVM)

# Create an instance of the MinMaxScaler
scaler = MinMaxScaler()

# Scale the training and testing data
x_train_scaled = scaler.fit_transform(x_train_array)
x_test_scaled = scaler.transform(x_test_array)

# Initialize and train the SVM model
svm_clf = SVC()
svm_clf.fit(x_train_scaled, y_train_array)

# Predict the target variable for the test dataset
y_test_pred_svm = svm_clf.predict(x_test_scaled)

Decision Trees

# Initialize and train the decision tree model
dt_clf = DecisionTreeClassifier()
dt_clf.fit(x_train_array, y_train_array)

# Predict the target variable for the test dataset
y_test_pred_dt = dt_clf.predict(x_test_array)

Random Forest

# Initialize and train the random forest model
rf_clf = RandomForestClassifier()
rf_clf.fit(x_train_array, y_train_array)

# Predict the target variable for the test dataset
y_test_pred_rf = rf_clf.predict(x_test_array)

Results

The performance of each model was evaluated using accuracy, precision, recall, and F1-score. Confusion matrices were also generated to visualize the performance of each model.

Dependencies

pandas
numpy
matplotlib
seaborn
scikit-learn
statsmodels

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
solution-files-machine-learning-for-user-classification		solution-files-machine-learning-for-user-classification
.gitignore		.gitignore
Instructions.mp4		Instructions.mp4
Machine Learning Project.ipynb		Machine Learning Project.ipynb
README.md		README.md
ml_datasource.csv		ml_datasource.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Project for User Classification

Table of Contents

Project Overview

Data Preprocessing

Model Training and Evaluation

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Decision Trees

Random Forest

Results

Dependencies

About

Releases

Packages

Languages

tahaAmineMiri/ML_user_classification

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Project for User Classification

Table of Contents

Project Overview

Data Preprocessing

Model Training and Evaluation

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Decision Trees

Random Forest

Results

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages